Selective requests for authentication for voice-based launching of applications

ABSTRACT

Systems, methods, and computer-readable media are disclosed for systems and methods for selective requests for authentication for voice-based launching of applications. Example methods may include receiving first audio data representing an utterance, determining that the device is in a first operating mode when the audio data was received, determining that the device is in a locked state when the audio data is received, and receiving, from a remote system, a command to display information based at least on part on the audio data. Certain methods may include receiving an indication that the utterance was spoken by a user authorized to access the information while in the first operating mode and the locked state, and causing presentation of the information by the device.

CROSS REFERENCE APPLICATION

This application is a continuation of U.S. application Ser. No.15/921,263 filed, Mar. 14, 2018, which is incorporated herein byreference in its entirety.

BACKGROUND

Electronic devices, such as smartphones, tablets, computers, and soforth may be used by users to consume digital content, play games,request information, and the like. Users may interact with devices viacontrols, touch inputs, and, in some instances, voice commands. Usersmay desire interacting with such devices in different manners atdifferent times. However, changing an interaction mode of a device maybe cumbersome or inconvenient.

BRIEF DESCRIPTION OF THE DRAWINGS

The detailed description is set forth with reference to the accompanyingdrawings. The drawings are provided for purposes of illustration onlyand merely depict example embodiments of the disclosure. The drawingsare provided to facilitate understanding of the disclosure and shall notbe deemed to limit the breadth, scope, or applicability of thedisclosure. In the drawings, the left-most digit(s) of a referencenumeral may identify the drawing in which the reference numeral firstappears. The use of the same reference numerals indicates similar, butnot necessarily the same or identical components. However, differentreference numerals may be used to identify similar components as well.Various embodiments may utilize elements or components other than thoseillustrated in the drawings, and some elements and/or components may notbe present in various embodiments. The use of singular terminology todescribe a component or element may, depending on the context, encompassa plural number of such components or elements and vice versa.

FIG. 1 is a schematic illustration of an example use case forvoice-forward graphical user interface mode management in accordancewith one or more example embodiments of the disclosure.

FIG. 2 is a schematic illustration of an example process flow forvoice-forward graphical user interface mode management in accordancewith one or more example embodiments of the disclosure.

FIG. 3 is a schematic illustration of example user interfaces forvarious device operation modes in accordance with one or more exampleembodiments of the disclosure.

FIG. 4 is a schematic illustration of an example use case forvoice-forward enablement of different versions of an application inaccordance with one or more example embodiments of the disclosure.

FIG. 5 is a schematic illustration of an example process flow forvoice-forward enablement of different versions of an application inaccordance with one or more example embodiments of the disclosure.

FIG. 6 is a schematic illustration of example user interfaces forvarious device operation modes and corresponding versions ofapplications in accordance with one or more example embodiments of thedisclosure.

FIG. 7 is a schematic illustration of an example process flow forvoice-forward changes to device operation modes in accordance with oneor more example embodiments of the disclosure.

FIG. 8 is a schematic illustration of an example use case forvoice-forward changes to device operation modes in accordance with oneor more example embodiments of the disclosure.

FIG. 9 is a schematic illustration of example user interfaces of anapplication in different device operation modes in accordance with oneor more example embodiments of the disclosure.

FIG. 10 is a schematic illustration of an example process flow forselective requests for passwords for voice-forward requests forinformation or applications in accordance with one or more exampleembodiments of the disclosure.

FIG. 11 is a schematic illustration of an example process flow forselective requests for authentication for voice-forward requests forinformation or applications in accordance with one or more exampleembodiments of the disclosure.

FIGS. 12-13 are schematic illustrations of example use cases forselective requests for passwords for voice-forward requests forinformation or applications in accordance with one or more exampleembodiments of the disclosure.

FIG. 14 is a schematic block diagram of an illustrative device inaccordance with one or more example embodiments of the disclosure.

FIG. 15 is a schematic block diagram of components of a system inaccordance with one or more example embodiments of the disclosure.

FIG. 16 is a system flow diagram illustrating user recognition inaccordance with one or more example embodiments of the disclosure.

FIGS. 17-18 are schematic diagrams of how natural language processingmay be performed in accordance with one or more example embodiments ofthe disclosure.

FIG. 19 illustrates data stored and associated with profiles accordingto embodiments of the present disclosure.

DETAILED DESCRIPTION Overview

Electronic devices, such as tablets, smartphones, computers, and thelike may be configured to operate in different device operation modes.Device operation modes may include, for example, touch-forward operationmodes, voice-forward operation modes, user-presence based operationmodes, and other operation modes. Operation modes may dictate or beassociated with visual displays and/or user interfaces that arepresented at a device. For example, a tablet operation mode may presentan application user interface or another user interface, such as a homescreen or operating system interface, in a first configuration, and avoice-forward operation mode may present the application user interfaceor other user interface in a second configuration that is different thanthe first configuration. Differences between user interfaces presentedin different operation modes may include differences in content layoutor arrangement, differences in an amount of content presented or acontent and/or information density, differences in available optionsthat are presented at the device, and the like. For example, a userinterface presented in a touch-forward operation mode may have moreinformation or content, or a higher information or content density, thana user interface presented in a voice-forward operation mode.

A touch-forward or touch-based operation mode of a device may be a modewhere the user experience with the device is optimized for touch input,or is touch-forward. As a result, this may indicate that the user isphysically in proximity to the device, so as to provide touch inputs.The touch-forward operation mode may therefore use smaller fonts,include more options or selections, present more information or have arelatively higher content density than other modes, and so forth sincethe user may be physically close to the device (e.g., holding thedevice, etc.). This may be because the user is physically closer to thedevice, and can more easily consume information and/or provide inputs.Voice inputs may complement touch-forward operation modes. Intouch-forward operation mode, voice commands can still be used tointeract with the device (e.g., “play that one”). In touch-forwardoperation mode, an overall experience may be optimized for touch inputs.

In contrast, a voice-forward or voice-based operation mode of a devicemay be a mode where the user experience with the device is optimized forvoice input, or is voice forward. As a result, the user may be able tointeract with the device from a greater distance, since the user may nothave to touch the device in order to make an input at the device. Forexample, a user in a kitchen environment may have wet hands and may notwant to touch the device, and may therefore interact with the device viavoice. In addition, because the user may be further away from thedevice, user interface fonts may be relatively larger, and anyselectable elements, if any, may be less in number than in atouch-forward operation mode. The content density may be reduced, so asto improve readability or consumption from a greater distance than, forexample, content presented in the touch-forward operation mode. Thevoice-forward operation mode may therefore be different from thetouch-forward operation mode and may encourage voice-forwardinteractions with the device, such as by providing voice hints (e.g.,“say show me the recipe,” etc.). Touch inputs may complementvoice-forward operation modes. In voice-forward operation mode, touchinputs can still be used to control the device. In voice-forwardoperation mode, an overall experience may be optimized for voice inputs.

Embodiments of the disclosure include systems and methods forapplication-based device operation mode management and/or voice-forwardgraphical user interface mode management. Certain embodiments may useapplication settings and/or device settings to manage changes to theoperating mode of a device. As a result, user experiences with thedevice may be improved by avoiding interruption to content that a usermay be consuming, and automatically shifting device operation modesbased on likely user interactions with the device. Some embodiments maybe configured to change device operation modes based on voice inputs orvoice commands, and may be configured to change operation modes based onapplications that are opened responsive to voice commands. In someembodiments, password protections on a device may be bypassed as aresult of a voice input from a speaker or user that is likely to beauthorized to access the device. As a result, a user can interact withthe device using voice, rather than having to physically approach thedevice and input a password.

This disclosure relates to, among other things, devices, systems,methods, computer-readable media, techniques, and methodologies forvoice-forward graphical user interface mode management. Certainembodiments manage changes to operation modes of device, for example, bydeferring an operation mode change, canceling an operation mode change,overriding an operation mode change, and other management. Someembodiments may use application settings, such as settings that preventa computer processor or device display from sleeping, to determinewhether to implement a device operation mode change. Some embodimentsmay determine whether a device operation mode is to be implemented basedon availability of different versions of application interfaces (e.g.,voice-forward versions, touch-forward versions, etc.) and/or theavailability of a related remote application (e.g., skills that can beenabled at remote servers, etc.). Certain embodiments may use voice dataor audio data to determine whether to prompt a user for a passwordbefore presenting certain information to the user at a device. Certainembodiments may determine content being presented at a device, as wellas a current function of a device, in order to determine whether toswitch a device operation mode.

Some embodiments of the disclosure may leverage a wake-lock or otherapplication setting of an application or operating system to determinewhether a device operating mode or operation mode is to be modified. Forexample, a tablet may be configured to operate in a touch-forwardoperation mode and a voice-forward operation mode. Handling changes inmode may be based on an application that is running at the time a modechange is to occur. For example, when the tablet is docked, a modechange from the touch-forward mode to the voice-forward mode maynormally occur, but because of a specific application running at thetime of docking, the mode change may be deferred or canceled, so as toreduce interruptions to application related content that a user may beconsuming.

Certain embodiments may use voice-forward commands to change deviceoperation modes without physical user interaction with the device. As aresult, users can change the configuration and density of contentpresented at a device from a distance, and may not have to undock thedevice or manually change a device operation mode. Certain embodimentsmay manage when passwords or codes are needed to access certain data ondevices. For example, a user may be interacting with a tablet or otherdevice using voice commands, and may request that a certain application,such as a calendar, be opened. Embodiments may determine whether theuser will have to enter a passcode at the device, for example, based ona likelihood that the user is authorized to access the device orapplication, the type of application or data being requested, and/or thelike. The user experience with the device may therefore be improved bynot requiring a user to enter a passcode to access data or content theuser is authorized to access.

Referring to FIG. 1, an example use case 100 for voice-forward graphicaluser interface mode management is depicted in accordance with one ormore example embodiments of the disclosure. In FIG. 1, at a firstinstance 110, a user may be using a device 112 to consume video content.The video content may be optionally presented in a full screen view orformat. For example, the user may be consuming a movie, television show,or other digital content. The device 112 may be operating in atouch-forward operation mode at the first instance 100, as the user isholding the device and/or may have navigated to the video usingtouch-forward inputs.

The user may place the device 112 at a docking station 114 or otheraccessory device, such as a charging device to charge the device, forexample, or to set the device in an upright position so that the deviceno longer needs to be held by the user. Accessory devices may includenon-power providing devices, such as device stands or cradles.

At a second instance 120, the device 112 may be docked at the dockingstation 114. When the device 112 is connected to the docking station 114under usual circumstances, the device 112 may automatically convert orswitch from the touch-forward operation mode to a voice-forwardoperation mode, so that the user does not have to remain in physicalproximity to the device while it is docked and/or charging. In a dockedoperation mode, which may be a voice-forward operation mode in someembodiments, the device may be in an always on, always listening, andalways powered configuration.

However, because the user was consuming video content in a full screenmode, the device 112 may postpone or defer the automatic change inoperation status that was supposed to occur when connecting to thedocking station 114. This is because the user may still be consuming thevideo content, and may not desire to be interrupted with a change to thedevice operation mode. In some instances, video playback may beinterrupted when connected to an accessory device or a docking stationunless the video is being played in a full screen mode, so as to avoidpreventing a change in operation mode as a result of inlineadvertisement videos or other videos being presented at the device. Ininstances where audio content is being presented, such as music, adevice operation mode may be changed, but playback of the audio contentmay continue uninterrupted, so as to avoid negatively impacting a userexperience of the device. In other instances, the device operation modemay not be changed while audio in being presented in a backgroundenvironment.

After the device 112 is docked at the docking station 114 and the videocontent completes playback, the device 112 may return to the applicationinterface for the application that was used to present the videocontent, as illustrated at the second instance 120.

In some embodiments, an optional timeout period may be determined toelapse after the application interface is presented and/or afterplayback of the video content is complete. The device 112 may remain inthe touch-forward operation mode (or whatever the previous operationmode was) during the timeout period so as to allow the user to interactwith the application using touch-forward inputs.

After the timeout period has elapsed, the device 112 may change to avoice-forward operation mode, as illustrated at a third instance 130.The device 112 may therefore present a user interface associated withthe voice-forward operation mode. For example, the user interface mayinclude voice hints, user-specific information, and/or other content. Insome embodiments, the user interface may include an ambient clock and/orother content.

To manage changes to the operation mode of the device 112, an exampleprocess flow 140 is presented and may be performed, for example, by oneor more remote servers or at a device locally. The remote server and/ordevice may include at least one memory that stores computer-executableinstructions and at least one processor configured to access the atleast one memory and execute the computer-executable instructions toperform various actions or operations, such as one or more of theoperations in a process flow 140 of FIG. 1.

At a first block 150, it may be determined that a device is docked at adocking station. The device 112, for example, may determine that aconnector of the docking station 114 or the device 112 is coupled to aconnector or connector port of a connected device. A coupling caninclude a connection or any other means by which devices are physicallyand/or communicatively coupled. The connected device may be identifiedas the docking station 114. The device 112 or a remote server maydetermine, for example using a settings database, that a connection tothe docking station 114 causes an active user interface theme of thedevice to be set to a voice-forward user interface theme or operationmode. The voice-forward user interface may present digital content atthe display in a visual format or application interface having a firstcontent density. The content density may be relatively less than acontent density of user interfaces configured for touch-forwardoperation modes, because the user may be consuming the content from agreater distance.

In some embodiments, the device may be physically coupled to anaccessory, such as a case or a bumper, which is used to interface withthe accessory device and/or docking station. For example, the accessorymay be coupled to an input/output and/or charging port of the chargingdevice. The accessory may optionally include circuitry and/or aninput/output or charging port that couples with the docking station. Inother embodiments, the device may not be docked at a docking station,but could be coupled to any power-providing or other accessory device,such as a power cord, charging mat, and the like.

Certain embodiments may not need to be coupled to charging devices, andmay instead have operation modes that are associated with certainorientation or positions of the device. For example, if it is determinedthat the device is lean against a stand or a wall, or another accessorydevice, the device may switch operation modes. In some embodiments, ifit is determined (e.g., via feedback from one or more accelerometers,gyroscopes, and/or other sensors, etc.) that the device is in a certainposition, a device operation mode may be changed. For example, leaningthe device against a wall, lamp, or other structure, as determined byone or more motion sensors for a certain length of time, may cause thedevice operation mode to be changed. In such instances, the accessorydevice may not be identified or detected, and the device may changeoperation modes based at least in part on the device's sensorsdetermining that the device is tilted at a certain angle and/or is at acertain angle for a certain length of time. The device may determinethat the device is generally positioned physically in real space suchthat it can be viewed by a user, and may optionally determine that thedevice has not moved in a certain period of time (e.g., 30 seconds,etc.), and, in response, may switch into a voice-forward operating mode.

At block 160, a determination may be made that the docking station isassociated with a voice-forward operating mode. For example, dockingstations determined to include one or more microphones and one or morespeakers may be associated with a voice-forward operation mode, or mayautomatically cause the device to change to a voice-forward operationmode. In some embodiments, the device may determine that a connection tothe docking station causes activation of a voice-forward operating modeat the device. In an example, the device 112 may determine that atouch-forward operating mode is active at the device, such as at thefirst instance 110, and the touch-forward operating mode may presentdigital content in a visual format having a second content density thatis greater than the first content density of the voice-forward operationmode.

At block 170, a determination may be made that an active application ispreventing the device from sleeping. For example, the device 112 maydetermine that the active video playback application is preventing thedevice 112 from sleeping because of the content playback. To keep thedevice 112 from sleeping, the application may activate a wake-lock orother application setting or operation system setting. The device maydetermine that at least one component of the device (e.g., at least oneprocessor, a display, a sensor, etc.) is in a stay awake system state,where the stay awake system state prevents one or more components of thedevice from entering a sleep state. For example, the sleep state may beprevented during playback of the video content. In some embodiments, adifferent component of the device may be in a stay awake system state,such as a display, a location (e.g., GPS, etc.) component, a motionsensor (e.g., accelerometer, gyroscope, etc.) component, communicationscomponent, etc.

At optional block 180, it may be determined that the active applicationis a touch-forward application. For example, the device 112 maydetermine that the active video playback application is a touch-forwardapplication, and that therefore, the device 112 is in a touch-forwardoperation mode.

At block 190, changing the device operating mode to the voice-forwardoperating mode may be delayed or deferred. For example, because thedevice 112 is being used to consume the touch-forward content and/or isin the stay awake state, the automatic change in operation mode may betemporarily deferred or canceled so as to avoid interrupting the user'sconsumption. The device 112 may determine that activation of thevoice-forward operating mode is to be deferred while the at least oneprocessor is in the stay awake system state. The device 112 mayperiodically check or query the processor to determine if the processorhas been released from the stay awake state. In some embodiments, theapplication may send a notification or provide an indication that theprocessor has been released from the stay awake state or an on state.After determining that the at least one processor has been released fromthe stay awake system state after completion of playback of the videocontent, the device operation mode may be changed to the voice-forwardoperation mode, as illustrated at the third instance 130. The device 112may, in some embodiments, monitor for an audio signal representative ofa wake word spoken by a user in the voice-forward mode, or may otherwiselisten for a voice input.

Embodiments of the disclosure may include voice-forward graphical userinterface mode management, voice-forward management of device operationmodes, and selective requests for passwords in voice-forward operationmodes. Certain embodiments may determine when device operation modechanges are to be implemented, when operation mode changes are to bedeferred or canceled, and when operation modes are to be automaticallyimplemented. Certain embodiments may use voice-forward commands orinputs to cause changes to device operation modes, and certainembodiments may determine whether passwords or other authentication isneeded to access information at a device.

Example embodiments of the disclosure provide a number of technicalfeatures or technical effects. For example, in accordance with exampleembodiments of the disclosure, certain embodiments of the disclosure maychange device operation modes based at least in part on voice commands,determine whether passwords are needed to access information, identifyspeakers or users using audio data, such as voice data, automaticallydownload applications or enable remote skills, and present informationin various operation mode user interfaces. Certain embodiments mayenable different operation modes that may have different user interfacesresponsive to connections to certain accessories, voice inputs, couplingto accessories, and other inputs. As a result of improved functionality,device operation mode experiences may be bridged across variousoperation modes, including touch-forward operation modes andvoice-forward operation modes. Embodiments of the disclosure may improvecomputing efficiency and bandwidth by managing device operation modesand increasing a number of manners of inputs at devices. The aboveexamples of technical features and/or technical effects of exampleembodiments of the disclosure are merely illustrative and notexhaustive.

One or more illustrative embodiments of the disclosure have beendescribed above. The above-described embodiments are merely illustrativeof the scope of this disclosure and are not intended to be limiting inany way. Accordingly, variations, modifications, and equivalents ofembodiments disclosed herein are also within the scope of thisdisclosure. The above-described embodiments and additional and/oralternative embodiments of the disclosure will be described in detailhereinafter through reference to the accompanying drawings.

Illustrative Process and Use Cases

FIG. 2 depicts an example process flow 200 for voice-forward graphicaluser interface mode management in accordance with one or more exampleembodiments of the disclosure. While example embodiments of thedisclosure may be described in the context of touch-forward andvoice-forward operation modes, it should be appreciated that thedisclosure is more broadly applicable to any operation modefunctionality. Some or all of the blocks of the process flows in thisdisclosure may be performed in a distributed manner across any number ofdevices. The operations of the process flow 200 may be optional and maybe performed in a different order.

At block 210 of the process flow 200, computer-executable instructionsstored on a memory of a device, such as a remote server or a userdevice, may be executed to determine that the device is connected to acharging device or an accessory device. For example, a tablet or otherelectronic device may have a connector configured to engage a chargingdevice, such as a wall charger, external battery, docking station, etc.,or the device may have a connector port configured to receive aconnector of a charging device. The connector and/or connector port maybe removably connected to the device. For example, the connector orconnector port may be an accessory coupled to the device.

At optional block 220 of the process flow 200, computer-executableinstructions stored on a memory of a device, such as a remote server ora user device, may be executed to determine that the device is to switchto a device operating mode or activate a certain operating modeassociated with the accessory device or charging device as a result ofbeing coupled to the charging device or other accessory device. Forexample, in some instances, a handshake protocol or exchange between thecharging device or accessory device may be used to determine whether thedevice is to activate a certain operating mode. In some embodiments,determining that a device is connected to a charging device or otheraccessory device may include identifying a connected device as thecharging device, where a connection to the charging device causes anautomatic change in the device operating mode from a first operatingmode to a second operating mode, unless a component of the device, suchas one or more computer processors or a display, are held in an awakestate by an application.

At block 230 of the process flow 200, computer-executable instructionsstored on a memory of a device may be executed to determine that thedevice is to change a device operating mode from a first operating modeto a second operating mode. For example, a connection to a certain typeof charging device, or a specific charging device (e.g., as determinedby a charging device identifier, etc.), may usually cause automaticchanges to a device operating mode of the device, with certainexceptions in some embodiments. For example, a charging device may beassociated with an operation mode of a voice-forward operation mode.Connecting the device to that charging device may cause the operationmode of the device to be automatically changed or switched to thevoice-forward operation mode. In an example, a docking station may beassociated with a second operation mode of a voice-forward operationmode, where the user interface presented at the display is voice-forwardor encourages users to interact with the device via voice input (e.g.,relatively less number of selectable options, presenting voice hints,etc.). A device may be operating in a first operating mode of atouch-forward operation mode prior to being connected to the dockingstation charging device. When the device is connected to the dockingstation, the device may change the device operation mode from thetouch-forward operation mode to the voice-forward operation mode. Thismay be because a user may interact with the device from increaseddistances while the device is charging and/or docked at the dockingstation. In some embodiments, the operation mode of the device mayalways be changed based at least in part on the type of connected deviceand/or charging device. In other embodiments, the operation mode of thedevice may be changed unless there is an exception or other rule. Insuch instances, changes to operation modes may be deferred or canceled.If the device is connected to a normal charging device the firstoperating mode may persist while the device is connected to the normalcharging device (e.g., no mode change may occur, etc.).

At block 240 of the process flow 200, computer-executable instructionsstored on a memory of a device may be executed to determine that anapplication setting of an application, or an application state,executing on the device is causing the one or more computer processorsto remain in an awake state. For example, one or more applications orcomputer programs may be executing on the device. The respectiveapplications may have one or more application settings. The applicationsettings may be settings that relate to operation of the device. Forexample, the application may have a wake-lock setting that causes one ormore components to remain in an “awake” or always on state, as opposedto a hibernate, standby, sleep, off, or other state. The applicationsetting may therefore prevent the computer processor, display, or othercomponent of the device from sleeping while work is being done or thecomponent is being used by the application, for example. In someembodiments, application settings may include a screen-on or display-onsetting that causes a display of the device to remain in an illuminatedmode. In some instances, displays may remain on as a result of anadditional setting (e.g., screen-on setting, etc.), or as a result of awake-lock setting. Once the application has completed its work, theapplication setting may be modified or changed so as to remove the holdon the component, such as the processor(s), that are being held in theawake state. Specifically, a first value associated with the applicationsetting may be replaced with a second value associated with theapplication setting. In some embodiments, computer-executableinstructions stored on a memory of a device may be executed to determinethat an application setting of an application that is executing orotherwise active on the device is causing the one or more computerprocessors, or another component of the device, such as a display, toremain in an awake state. This determination may be made by queryingactive applications, or by determining whether the computer processorsare in a wake-lock or awake state. If so, the device may determine thatthe computer processors are in the wake-lock or awake state as a resultof some application setting, and the application causing the wake-lockmay not be identified. In some embodiments, application settings of anactive application (or an application executing in a foreground of acomputing environment, etc.) may be checked to determine whether acertain application setting is active or selected. The automatic changeto device operation mode may be deferred while the application settingis active or remains in the same state. In some instances, the automaticchange to device operation mode may be canceled if a timeout periodelapses without a change to the application setting.

At block 250, computer-executable instructions stored on a memory of adevice may be executed to determine that the application setting, or theapplication state, has been modified. For example, the applicationsetting may be deactivated or changed to a different setting. In oneinstance, a wake-lock or stay awake setting may be modified or turnedoff. As a result, the computer processors may no longer be held in anawake state. The application may release resources back to the device.The state or status of the computer processors or other component thatis being held awake may be periodically checked, or the applicationsetting may be queried, so as to determine that the application settinghas been modified.

At block 260, computer-executable instructions stored on a memory of adevice may be executed to cause the device operating mode to be changedfrom the first operating mode to the second operating mode. For example,once the application setting is modified, the automatic change to thedevice operation mode as a result of connecting to the charging devicemay be implemented, and the device operation mode may be changed fromthe first operation mode to the second operation mode.

At optional block 270, computer-executable instructions stored on amemory of a device may be executed to present a user interfaceassociated with the second operating mode instead of an applicationinterface of the application. For example, the second operating mode maybe associated with a different user interface layout or home screen thanthe first operating mode. In some embodiments, when the device operationmode is changed, an application interface or user interface that wasbeing presented in the previous operation mode may be replaced with theuser interface associated with the new operation mode. For example, auser may have been watching a video on Netflix in the first operationmode, and after the change to the second operation mode, the Netflixapplication interface may be replaced by a user interface or home screenassociated with the second operation mode. In some embodiments, theNetflix application interface may be replaced with a reformatted Netflixinterface that is reformatted for the second operation mode.

In some embodiments, a timeout period may follow the change to theapplication setting before a change to the operation mode isimplemented. For example, the device may determine that a timeout periodhas elapsed after the change to the application setting. The device mayremain in the touch-forward operating mode and/or may present anapplication interface of the application in the first operating modeafter the application setting has been modified during the timeoutperiod. In some instances, if the application setting is not changedwithin a certain length of time (e.g., a mode change cancelation lengthof time after which pending mode changes are canceled, etc.) after beingconnected to the docking station, the change to the device operationmode may be canceled. For example, if it is determined that a modechange cancelation length of time has elapsed, the device may cancel achange to, or scheduled change to, the device operating mode.

FIG. 3 depicts example user interfaces 300 for various device operationmodes in accordance with one or more example embodiments of thedisclosure. In the example of FIG. 3, a device 310 may be operating in atouch-forward operation mode at a first instance 320. For example, auser of the device 310 may be holding the device and interacting viatouch input.

At a second instance 330, the device 310 may be connected to a dockingstation 340. When connected to the docking station 340, the device 310may automatically convert or change from the touch-forward operationmode to a voice-forward operation mode. The device 310 may present auser interface or home screen associated with the voice-forwardoperation mode when in the voice-forward operation mode and/or whileconnected to the docking station 340.

As illustrated at a third instance 350, if the user disconnects thedevice 310 from the docking station 340, the device 310 mayautomatically return to the previous operation mode, or thetouch-forward operation mode. In some embodiments, as illustrated, thedevice 310 may present a home screen or user interface associated withthe touch-forward operation mode when disconnected from the dockingstation 340, while in other embodiments, the device 310 may return to apreviously opened application that was open when the device 310 wasconnected to the docking station 340. For example, when connected to thedocking station 340, the device 310 may present a user interfaceassociated with the voice-forward operation mode, and after determiningthat the device 310 is disconnected from the charging device or dockingstation 340, an application interface of the application in thetouch-forward operation mode may be presented.

In instances where disconnecting from the docking station 340 causes thedevice 310 to return to a previously opened application, the device 310may determine a first application user interface that is presented priorto the connection to the docking station 340, present a voice-forwardoperating mode user interface after the voice-forward operating mode isactivated, determine that the docking station 340 is disconnected, andagain present the first application user interface in the touch-forwardoperating mode. In some embodiments, when disconnected from the dockingstation 340, or when returning to a touch-forward operation mode, thedevice 310 may cease monitoring for an audio signal or wake word.

FIG. 4 is a schematic illustration of an example use case 400 forvoice-forward enablement of different versions of an application inaccordance with one or more example embodiments of the disclosure.

At a first instance 410, a device 412 may be docked or connected to adocking station 414. The device 410 may be in a voice-forward operationmode. While the device 412 is docked, a user may say a voice input of“Alexa, can you open my photo app?” The device 412 may determine thatthe voice input is indicative of a request to open an application. Thedevice 412 may determine whether or not a voice-forward version of therequested application is available (e.g., installed on, etc.) to open atthe device 412, so that opening the application does not cause a deviceoperation mode change. For example, the device 412 may determine thatthe version of the application available at the device 412 is atouch-forward application version. The device 412 may determine that avoice-forward version of the application is available for enablement.For example, the device 412 may query an application store or datarepository to determine that a voice-forward version of the requestedapplication is available for enablement, such as by downloading to thedevice 412 or activating a remote application at a remote server. Thedevice 412 may audibly present a query requesting permission to enablethe voice-forward version. For example, at the first instance 410, thedevice may audibly present “there is a voice-forward version of thephoto app available; should I enable it?” The user may say “yes” and thedevice 412 may determine that the user provided an affirmative response.Requesting permission to enable an application or a version of anapplication (or download other software) may include causingpresentation of an audible query requesting permission to enable theversion of the application.

As a result, at a second instance 420, the device 412 may enable thevoice-forward version of the application. An indication of installing oractivation progress may be presented at the device 412.

At a third instance 430, the device 412 may cause the voice-forwardversion of the application to be opened. The user may interact with thevoice-forward version of the application using voice inputs.

FIG. 5 depicts an example process flow 500 for voice-forward enablementof skills or different versions of an application in accordance with oneor more example embodiments of the disclosure. While example embodimentsof the disclosure may be described in the context of touch-forward andvoice-forward operation modes, it should be appreciated that thedisclosure is more broadly applicable to any operation modefunctionality. Some or all of the blocks of the process flows in thisdisclosure may be performed in a distributed manner across any number ofdevices. The operations of the process flow 500 may be optional and maybe performed in a different order.

At block 510 of the process flow 500, a verbal request to open anapplication may be received. For example, a microphone at a device, suchas a tablet device or a speaker device, may be used to capture an audiosignal in an ambient environment. The audio signal may be converted to adigital signal and/or audio/voice data. The audio signal may bedetermined to be a voice command, for example, by the presence of a wakeword, such as “Alexa.” A meaning of the voice command may be determinedusing voice processing, which may include speech-to-text processing,natural language processing, and/or other forms of voice processing. Themeaning of the voice command “Alexa, open Amazon music” may bedetermined to be a verbal request to open an application. In otherinstances, a verbal request to access content or a particular service,such as a streaming service, may be received.

At optional determination block 520, a determination may be made as towhether the device is in a voice-forward operating mode. For example,computer-executable instructions stored on a memory of a device may beexecuted to determine an operation mode of the device. Operation modesmay include, for example, touch-forward operation modes, voice-forwardoperation modes, hybrid operation modes, and/or other operation modes.The operation mode may be optionally determined to be a voice-forwardoperation mode. In some embodiments, the operation mode may bedetermined by identifying a type of charging device connected to thedevice. For example, if the device is connected to a docking station,the device may be determined to be in a voice-forward operation mode. Ifit determined at determination block 530 that the device is not in avoice-forward operation mode, the process flow 500 may proceed to block530, at which a touch-forward version of the application may be opened.For example, if the device is not in a voice-forward operation mode, orthe device is in a touch-forward operation mode, the device may open atouch-forward version of the application that was requested by a user inthe verbal request. Touch-forward versions of applications may beversions of applications, or independent applications, that have atouch-forward or touch-forward user interface that encourages user tointeract with the application using touch inputs as a primary method ofinteraction. In some instances, applications may have different versionswith different user interfaces geared towards touch-forward orvoice-forward interactions, while in other instances, separateapplications (or standalone applications) may be used to providedifferent user interfaces of the same applications. If the device is notin a voice-forward operation mode, as determined at determination block520, that may indicate that the user is physically interacting with thedevice, and that the user therefore desires that a touch-forward versionof the application be opened. Accordingly, the touch-forward version ofthe application may be opened if the device is operating in anon-voice-forward operation mode. When opening the touch-forward versionof the application, the device may change operation modes to atouch-forward operation mode.

If it is determined at optional determination block 520 that the deviceis operating in a voice-forward operation mode, the process flow 500 mayproceed to determination block 540, at which a determination may be madeas to whether a voice-forward version of the application is available atthe device. For example, an available application at the device may beconfigured to operate in different operation modes, such astouch-forward or voice-forward. In some instances, two separate versionsof the application may be available at the device, each configured tooperate in a different operation mode. A determination may be made as towhether a voice-forward version of the application (e.g., whether theapplication itself can be configured to operate in voice-forward mode orthere is a separate voice-forward version of the application, etc.) isavailable at the device. Availability at the device may indicate thatthe program or application is available for execution at the device. Ifit is determined at determination block 540 that there is avoice-forward version of the application available at the device, theprocess flow may proceed to block 550, at which the voice-forwardversion of the application is opened, or the relevant applicationsetting that controls the operation mode of the application is set to avoice-forward operation mode. The user may then interact with theapplication using voice input and/or touch input. When opening thetouch-forward version of the application, the device may changeoperation modes to a touch-forward operation mode.

If it is determined at determination block 540 that there is novoice-forward version of the application available at the device, orthat there is no operation mode setting of the application that can bechanged to cause voice-forward operation, the process flow may proceedto determination block 560, at which a determination may be made as towhether a voice-forward version of the application is available forenablement. Enablement may include downloading data onto a clientdevice, activating a remote application in connection with a useraccount associated with the client device (e.g., enabling an Alexa skillat one or more remote servers, etc.), activating a local application,and the like. For example, a determination may be made as to whether avoice-forward version of the application is available for enablement,such as whether the application is available for enabling at a remoteserver, and/or downloading from an application store, a data repository,another device, or another datastore. In some embodiments, thevoice-forward version may be configured to be enabled as a skill, whichmay interface with a separate application through one or moreapplication programming interface(s). Access to the skill may requireuser permission to be enabled in some instances. Accordingly, in someembodiments, a voice-forward version of an application may not have tobe determined to be available for enablement, but access to avoice-forward skill may be determined to be available, or both.

If it is determined at determination block 560 that there is novoice-forward version of the application available for download and/orno skill or remote application available for enablement, the processflow may proceed to block 530, at which the touch-forward version of theapplication may be opened. When opening the touch-forward version of theapplication, the device may change operation modes to a touch-forwardoperation mode.

If it is determined at determination block 560 that there is avoice-forward version of the application available for enablement ordownload, the process flow may proceed to block 570, at which an audiblequery representing permission to download or enable the voice-forwardversion may be presented. For example, one or more speakers of thedevice may be used to present an audible query of “there is avoice-forward version of Amazon music available, should I enable it?” or“would you like to enable the Amazon music skill?” In some embodiments,a visual query may be presented on a display of the device in additionto or instead of the audible query. The process flow may proceed todetermination block 580.

At determination block 580, a determination may be made as to whether anaffirmative response was received. For example, after presenting theaudible query or visual selection, the device may monitor for a verbalaffirmative response such as “yes” or “go ahead,” or a selection of a“yes” or other affirmative input may be received at a display of thedevice. If it is determined at determination block 580 that anaffirmative response was not received, such as a “no” response, or thatno response was received within a time interval, the process flow mayproceed to block 530, at which the touch-forward version of theapplication may be opened.

If it is determined at determination block 580 that an affirmativeresponse was received, the process flow may proceed to block 590, atwhich the voice-forward version of the application may be downloaded orinstalled, or the skill may be enabled. After enabling, the process flowmay proceed to block 550, at which the voice-forward version of theapplication may be opened. In some embodiments, access to an applicationor service, such as a music service or video subscription service, maybe enabled instead of a voice-based version of an application. Forexample, a request may be made of an aggregator service, which mayselect a specific service provider from a number of service providers.In some embodiments, such aggregators may not be applications, but mayhandle requests for services or content.

As a result of the process flow, a user of the device may not have tochange a device operation mode to interact with an application. Forexample, if the user is interacting with the device in a voice-forwardoperation mode, and requests an application that is available in atouch-forward operation mode, the device may automatically implementsome or all of process flow 500 to facilitate continued use of thedevice and the requested application in the voice-forward mode, withouthaving the user physically interact with or touch the device, in someembodiments.

FIG. 6 depicts example use cases 600 of various device operation modesand corresponding versions of applications in accordance with one ormore example embodiments of the disclosure. In FIG. 6, users may requestthat a certain application be opened at a device.

If the requested application is in a format or configured to operate inan operation mode that is different than a current operation mode of thedevice, the device may determine whether another version of theapplication is available for enablement or to use, so that the devicecan continue in the current operation mode. If there is another versionof the application, the device may request permission to enable andlaunch the application version. However, in some instances permission toenable may not be granted. As a result, the existing version of theapplication may be opened or launched at the device, and the deviceoperation mode may be changed accordingly. For example, in FIG. 6 at afirst instance 610, a user may request that a social media applicationbe launched. The device 612 may be in a voice-forward operation mode.However, a voice-forward version of the application may not be availableat the device 612. If a voice-forward version is available for enable,the device may request permission to enable. If permission is notgranted, or there is no voice-forward version of the applicationavailable, the device 612 may proceed with opening the touch-forwardversion of the application. As shown at the first instance 610, althoughthe device 612 is in a landscape orientation, the application may belaunched in a touch-forward operation mode, and may be presented in aportrait orientation, regardless of the positioning of the device inlandscape mode. The user may interact with the application intouch-forward operation mode.

If a version of the application in touch-forward mode is determined tobe available, and permission was granted to enable, the applicationversion may be enabled and launched, as shown at a second instance 620.As a result, the user may continue interacting with the device and thelaunched application in the existing voice-forward operation mode.

FIG. 7 depicts an example process flow 700 for voice-forward changes todevice operation modes in accordance with one or more exampleembodiments of the disclosure. While example embodiments of thedisclosure may be described in the context of touch-forward andvoice-forward operation modes, it should be appreciated that thedisclosure is more broadly applicable to any operation modefunctionality. Some or all of the blocks of the process flows in thisdisclosure may be performed in a distributed manner across any number ofdevices. The operations of the process flow 700 may be optional and maybe performed in a different order.

At block 710 of the process flow 700, computer-executable instructionsstored on a memory of a device, such as a remote server or a userdevice, may be executed to determine that a device is connected to anaccessory device. For example, a tablet or other electronic device mayhave a connector configured to engage a charging device, such as a wallcharger, external battery, docking station, etc., or the device may havea connector port configured to receive a connector of a charging device.The connector and/or connector port may be removably connected to thedevice. For example, the connector or connector port may be an accessorycoupled to the device. In other instances, accessory devices may includea stand, a charging cradle, a lamp, or another accessory device.

At block 720 of the process flow 700, computer-executable instructionsstored on a memory of a device may be executed to determine that thedevice is to change a device operating mode from a first operating modeto a second operating mode. For example, a connection to a certain typeof charging device, or a specific charging device (e.g., as determinedby a charging device identifier, etc.), may usually cause automaticchanges to a device operating mode of the device, with certainexceptions in some embodiments. For example, a charging device may beassociated with an operation mode of a voice-forward operation mode.Connecting the device to that charging device may cause the operationmode of the device to be automatically changed or switched to thevoice-forward operation mode. In an example, a docking station may beassociated with a second operation mode of a voice-forward operationmode, where the user interface presented at the display is voice-forwardor encourages users to interact with the device via voice input (e.g.,relatively less number of selectable options, presenting voice hints,etc.). A device may be operating in a first operating mode of atouch-forward operation mode prior to being connected to the dockingstation or other charging device. When the device is connected to thedocking station, the device may change the device operation mode fromthe touch-forward operation mode to the voice-forward operation mode.This may be because a user may interact with the device from increaseddistances while the device is charging and/or docked at the dockingstation. In some embodiments, the operation mode of the device mayalways be changed based at least in part on the type of connected deviceand/or charging device. In other embodiments, the operation mode of thedevice may be changed unless there is an exception or other rule. Insuch instances, changes to operation modes may be deferred or canceled.

At block 730 of the process flow 700, computer-executable instructionsstored on a memory of a device may be executed to cause the device tochange the device operating mode to the second operating mode. Thesecond operation mode may be a voice-forward operation mode configuredto encourage voice inputs by users or another type of operation mode.For example, after the device is connected to a certain charging devicesuch as the docking station, the automatic change to the deviceoperation mode as a result of connecting to the charging device may beimplemented. In an example, the device operation mode may be changedfrom a first operation mode of voice-forward operation mode to a secondoperation mode of a touch-forward operation mode, or vice versa.

At block 740, computer-executable instructions stored on a memory of adevice may be executed to receive first audio data, which may be firstvoice data, indicative of a request to change the device operating modeto the first operating mode. For example, a microphone of the device maycapture sound in an ambient environment and may generate an audio signalrepresentative of the sound. The audio signal may be converted to voicedata and may be processed using voice processing techniques to determinea meaning of the voice data. In some embodiments, the voice data and/oraudio signal may be sent or streamed to a remote server for voiceprocessing and/or to determine a meaning of the voice data. In someembodiments, detection of a wake word, such as “Alexa,” may be performedlocally at the device. In an example, a user may say a voice input of“Alexa, change to a touch-forward operation mode.” Voice datarepresenting the voice input may indicate that the user is requesting tochange the device operation mode to the touch-forward operation mode. Insome embodiments, the device may perform voice processing locally, whilein other embodiments, the device may receive instructions or anindication of the meaning of the voice data from a remote server orother computer system. Other voice commands or voice inputs may includevoice commands to close applications, close operation modes, openoperation modes, open applications, switch applications, switchoperation modes, etc.

At block 750, computer-executable instructions stored on a memory of adevice may be executed to cause the device to change the deviceoperating mode to the first operating mode. The first operation mode maybe a touch-forward operation mode configured to encourage touch inputsby users or another type of operation mode. For example, the deviceoperation mode may be changed (or caused to change) to the operationmode that was requested by the user, which may be the first operationmode in this example. In an example, the device operation mode may bechanged from a second operation mode of touch-forward operation mode toa first operation mode of a voice-forward operation mode, or vice versa.

At optional block 760, computer-executable instructions stored on amemory of a device may be executed to present a home screen userinterface associated with the first operating mode. For example, thedevice may be returned to a touch-forward operation mode based on theverbal request from the user. After changing to the touch-forwardoperation mode, a home screen user interface for the touch-forwardoperation mode may be presented at the device. Previously presentedapplication interfaces or user interfaces associated with the secondoperation mode, or the voice-forward operation mode, may be closedand/or replaced by the home screen user interface for the touch-forwardoperation mode. For example, a user may have been watching a newsbriefing video in the second operation mode, and after the change backto the first operation mode, the news briefing video may be replaced bya user interface or home screen associated with the first operationmode, such as a home screen with application access shortcuts. In someembodiments, the news briefing video may be replaced with a reformattedfor the second operation mode and presented. In some embodiments, whenreturning to the first operating mode, the last presented or most recentapplication may be presented. For example, if a news application wasactive in the first device operating mode prior to the switch to thesecond device operating mode, when the device returns to the firstoperating mode, the news application may be presented again.

FIG. 8 is a schematic illustration of an example use case 800 forvoice-forward changes to device operation modes in accordance with oneor more example embodiments of the disclosure. At a first instance 810,a device 812 may be docked at a docking station 814. The device 812 maybe in a voice-forward operation mode. In some embodiments, the device812 may not have to be docked in order to be in a voice-forwardoperation mode.

While the device 812 is in the voice-forward operation mode, a user maysay a user utterance or a voice input of “change to touch mode.” Thedevice 812 may determine that the voice input is a request to change theoperation mode of the device. To determine what content to present withthe changed operation mode, in some embodiments, the device 812 maymaintain a presented application interface, but may present areformatted version of the application interface in accordance with thechange in operation mode. For example, the device may determine anactive application executing on the device, and may reformat anapplication interface presented at the device in the touch-forwardoperating mode for presentation in the voice-forward operating mode, ormay reformat the application interface from voice-forward operation modeto touch-forward operation mode. The reformatted application interfacemay be presented at the device.

In some embodiments, when a home screen user interface is presented atthe time the voice input to change operation modes is received, thedevice may change operation modes and present another home screen userinterface that is associated with the updated operation mode. Forexample, the user interface presented at the first instance 810 may be ahome screen user interface for a voice-forward operation mode of thedevice. At a second instance 820, a home screen user interface for atouch-forward operation mode may be presented when the device changes tothe touch-forward operation mode responsive to the voice input from theuser.

FIG. 9 is a schematic illustration of example user interfaces 900 of anapplication in different device operation modes in accordance with oneor more example embodiments of the disclosure. At a first instance 910,a device 912 may be used in a touch-forward operation mode or a tabletoperation mode while a user is holding the device 912. An applicationinterface, such as an a shopping application interface, may be presentedin a first configuration at the device 912 while the device is in thetouch-forward operation mode.

At a second instance 920, the device 912 may be docked at a dockingstation 914. When the device 912 is docked, the device 912 mayautomatically change a device operation mode from the touch-forwardoperation mode to a voice-forward operation mode. As a result, the sameapplication may be presented, but the application interface and/or auser interface that is presented may be configured for voice input as aprimary manner of interaction with the device 912. The user may causethe device 912 to change operation modes using voice input (e.g.,“switch to tablet mode,” etc.), by verbally requesting that atouch-forward application be opened or that the voice-forward operationmode be closed or canceled (e.g., “close voice-forward mode,” etc.), byphysically interacting with the device, and so forth.

FIG. 10 depicts an example process flow 1000 for selective requests forpasswords for voice-forward requests for information or applications inaccordance with one or more example embodiments of the disclosure. Whileexample embodiments of the disclosure may be described in the context oftouch-forward and voice-forward operation modes, it should beappreciated that the disclosure is more broadly applicable to anyoperation mode functionality. Some or all of the blocks of the processflows in this disclosure may be performed in a distributed manner acrossany number of devices. The operations of the process flow 1000 may beoptional and may be performed in a different order.

At block 1010 of the process flow 1000, computer-executable instructionsstored on a memory of a device, such as a remote server or a userdevice, may be executed to determine first voice data including a firstvoice request from a user to access information associated with a useraccount. For example, a user in an ambient environment of a device maysay “what's on my calendar for this afternoon?” or “Alexa, what's on mycalendar this afternoon?” The voice input or voice request may becaptured by one or more microphones of the device and converted to voicedata. The voice data may be processed to determine a meaning of thevoice input. The voice request may be determined to be a request toaccess information, such as calendar event information, associated witha user account, or a user account that is associated with the deviceand/or a calendar application on the device. In some embodiments, thevoice data may be processed locally, while in other embodiments, thevoice data may be sent to a remote server or other computer system forprocessing. In some embodiments, voice requests may be for certaininformation from an application, such as calendar event information froma calendar application, contact information from a contacts or directoryapplication, bank account balance information from a bankingapplication, and the like, whereas in other embodiments, voice requestsmay be to open certain applications. For example, a user may say “Alexa,open my calendar,” and so forth. Such requests may be treated orprocessed differently than requests for certain information that may bedetermined from applications.

At block 1020 of the process flow 1000, computer-executable instructionsstored on a memory of a device may be executed to determine that adevice at which the first voice data is received is in a locked state.For example, the device or a remote server may determine that the deviceis protected by a password, or that access to the device is restrictedwithout some form of authentication of a user. Access to the deviceand/or information or applications stored at the device may berestricted to authorized users. In some embodiments, access orpermission may be granted based on touch or voice input of a passcode(e.g., alphanumeric characters, etc.), a gesture, a biometric marker oridentifier (e.g., fingerprint, face scan, voice match, etc.), or anotherform of password. To determine whether a device is protected by apassword, the device and/or remote server may determine whether apassword setting is active at the device. Such determinations may bemade at the time, or within a time interval of, the voice request ismade. In some embodiments, devices may be transitioned from a lockedstate to an unlocked state using voice identification, authentication(e.g., voice command in addition to facial recognition or camera input,etc.), or other means.

At block 1030 of the process flow 1000, computer-executable instructionsstored on a memory of a device may be executed to determine that theuser is authorized to access the information using at least a portion ofthe first voice data. For example, the device may stream and/or send aportion of voice data and/or the audio signal to a remote server todetermine whether the user that requested the information is authorizedto access the information that was requested. Authorization may bedetermined based at least in part on a match between attributes of thevoice of the speaker or user and a set of stored attributes representinga voice of an authorized user, in order to determine whether the user isthe same as an authorized user.

To determine authorization, the remote server, or the device locally,may compare the voice data of the voice request, or attributes extractedfrom the voice data, to patterns of voices of users that are authorizedto access the device. Results of the comparison may be used to generatea confidence score that represents a likelihood or probability that theuser making the voice request is the same as an authorized user. Theconfidence score may be representative of a match between the requestinguser's voice and the voice of an authorized user in some embodiments.Based at least in part on the voice data and/or attributes of the user'svoice as determined from the voice data or audio signal, a determinationmay be made that the user making the voice request is authorized toaccess the information that was requested.

At block 1040, computer-executable instructions stored on a memory of adevice may be executed to cause presentation of the information at thedevice without requesting authentication. For example, the remote servermay cause the device to, or the device may, present the information thatwas requested without requesting the password. For example, the devicemay audibly or visually present the requested calendar information tothe user while bypassing the password restriction on the device. In anexample, the device may audibly present “you have a 3:00 meeting withLeBron James in Atlanta” responsive to the user's voice request. Inanother example, the device may present visual event informationindicating the 3:00 meeting on the user's calendar. If the device was ina touch-forward operation mode, the device may have required input ofthe password, whereas in the voice-forward operation mode, the devicemay bypass the password requirement and present the requestedinformation without requiring the password.

FIG. 11 depicts an example process flow 1100 for selective requests forauthentication for voice-forward requests for information orapplications in accordance with one or more example embodiments of thedisclosure. While example embodiments of the disclosure may be describedin the context of touch-forward and voice-forward operation modes, itshould be appreciated that the disclosure is more broadly applicable toany operation mode functionality. Some or all of the blocks of theprocess flows in this disclosure may be performed in a distributedmanner across any number of devices. The operations of the process flow1100 may be optional and may be performed in a different order.

At block 1110, a first verbal request to access information associatedwith a user account may be received at a device. For example, amicrophone at a device, such as a tablet device or a speaker device, maybe used to capture an audio signal in an ambient environment. The audiosignal may be converted to a digital signal and/or voice data. The audiosignal may be determined to be a voice command, for example, by thepresence of a wake word, such as “Alexa.” A meaning of the voice commandmay be determined using voice processing, which may includespeech-to-text processing, natural language processing, and/or otherforms of voice processing. The meaning of the voice command “Alexa, whatis Adam's phone number?” may be determined to be a verbal request toaccess information associated with a user account. Informationassociated with a user account may include contact information, calendarinformation, bank account information, order or purchase historyinformation, and/or other information that may be specific to a user orto a device. Information associated with a user account may includeinformation that is associated with user accounts of variousapplications stored at the device, such as service provider applications(e.g., rideshare applications, on demand applications, etc.). Theprocess flow may proceed to determination block 1120.

At determination block 1120, a determination may be made as to whetherthe device is in a locked state. For example, for a device that ispassword protected, a locked state may indicate that the password hasnot been entered. Once entered, the device may enter an unlocked state.At block 1120, a determination may be made by the device or by a remoteserver as to whether the device is in a locked state. If the device isin a locked state, a request for authentication may be presented, so asto unlock the device. For example, authentication may include passwords,passcodes, biometric signatures, gestures, and/or other authenticationmechanisms. A locked state may prevent access to the device untilauthentication is verified. Passwords may be alphanumeric passwords,graphic passwords, audible passwords, and the like. Other forms ofauthentication may include biometric passwords, gesture passwords,personal identification numbers, and/or other forms of authentication.If it is determined that the device is not in a locked state atdetermination block 1120, the process flow may proceed to block 1130, atwhich presentation of the information may be caused, or the informationmay be presented at the device. For example, the device may present on adisplay, or may audibly present, the information requested by the user,such as by audibly presenting “Adam's phone number is 888-280-4331.” Insome embodiments, the requested information may be presented regardlessof an operation mode of the device if the device is not in a lockedstate.

If it is determined at determination block 1120 that the device is in alocked state, the process flow may proceed to determination block 1140,at which a determination may be made as to whether the device is in adocked operation mode or in a certain location. A docked operation modemay be, for example, a voice-forward operation mode or another operationmode associated with a docked device for which the user may not be inphysical proximity to or may not be easily able to physically touch thedevice. If it is determined that the device is not in a docked operationmode, the process flow may proceed to block 1150, at whichauthentication may be requested. For example, if the device is in atouch-forward operation mode, that may indicate that the user isphysically near the device or is able to touch the device, and maytherefore be able to easily enter a password via touch input or provideanother authentication input. Accordingly, authentication may berequested. In some embodiments, passwords may be input or authenticationmay be occur via voice input. In some embodiments, device location maybe determined using a WiFi network identifier for a network to which thedevice is connected. If the device is connected to certain WiFinetworks, the determination at block 1140 may be positive.

If it is determined that the device is in the docked operation mode atdetermination block 1140, the process flow may proceed to optionaldetermination block 1160. At optional determination block 1160, adetermination may be made as to whether the requested information issensitive. For example, certain information may be determined to besensitive based at least in part on a sensitivity classification of theinformation or an application that the information is associated with orsourced from. For example, a calendar application with the user'spersonal calendar information may be determined to be sensitive becauseit is user-specific information and/or because the calendar applicationis classified as a sensitive application. Sensitivity may be determinedbased at least in part on a sensitivity classification of applications.If it is determined that the requested information is not sensitive, theprocess flow may proceed to block 1130, at which the information may becaused to be presented, or may be presented at the device. Accordingly,although the device may be password protected, the information may bepresented responsive to the voice command or verbal request, so as toavoid requiring the user to input the password or provide authenticationsince the information is not sensitive. An example of information thatis not sensitive may include information related to research questions(e.g., what time do the Falcons play today?, when will the store open?,etc.) and/or requests that are not specific to a user account or adevice.

In some embodiments, the device may receive an indication (e.g., from aremote server or other computer system, etc.) that the user isauthorized to access applications on the device. The indication mayinclude a confidence score that the user is authorized to accessapplications on the device. In some instances, the device or the remoteserver may determine a sensitivity classification of an application,such as the calendar application. The sensitivity classification may beindicative of a level of sensitivity of information associated with theapplication. The confidence score threshold may be determined foraccessing the application and/or information associated with theapplication using the sensitivity classification.

If it is determined at optional determination block 1160 that therequested information is sensitive, the process flow may proceed toblock 1170, at which a confidence score indicative of a likelihood thata user requesting the information is an authorized user may bedetermined using voice data. For example, if the information isdetermined to be associated with a user account, it may be determined tobe sensitive. At block 1170, the voice data associated with the verbalrequest may be processed to determine a confidence score that indicatesa likelihood that the user is authorized to receive the requestedinformation. In some embodiments, the confidence score may be determinedat a device, while in other instances, the confidence score may bedetermined at a remote server using the voice data, and an indication ofthe confidence score or a command to present or not present theinformation may be sent to the device. The voice data may be used toidentify the speaker or user that said the verbal request or voicecommand, and the confidence score may be an indication of a likelihoodthat the user is actually the speaker identified.

Confidence scores may be determined by extracting or determining one ormore attributes of a user's voice from the voice data, and comparing theresults to a predetermined set of attributes of authorized users'voices. Attributes may include pitch, patterns, cadence, accents,volume, and/or other attributes. The process flow may proceed todetermination block 1180.

At determination block 1180, a determination may be made as to whetherthe confidence score satisfies a threshold, such as a confidence scorethreshold. For example, after the confidence score is determined, theconfidence score may be compared to a confidence score threshold todetermine whether the confidence score is equal to or greater than thethreshold. For example, the confidence score threshold may be 80, and aconfidence score equal to or greater than 80 may satisfy the threshold.In some embodiments, the confidence score threshold may be dynamic andmay change based at least in part on a sensitivity classification of therequested information. For example, for more sensitive information, thethreshold may be relatively higher than for less sensitive information.

If it is determined at determination block 1180 that the confidencescore does not satisfy the threshold, the process flow may proceed toblock 1150, at which authentication is requested. If it is determinedthat the confidence score satisfies the threshold, the process flow mayproceed to block 1130, at which the presentation of the information iscaused, or the information is presented at the device. The informationmay be presented while the device is in the docked operation mode, insome embodiments. If no longer in a docked operation mode, the passwordmay be requested.

At optional block 1190, a second verbal request to open an applicationwhile in the docked operating mode. If such a request is received, forexample “open my calendar,” the process flow may proceed to block 1150,at which authentication is requested.

As a result, information that is requested from password protecteddevices or devices in locked states may be presented withoutauthentication or requiring input of a password, depending onauthentication or identification of a user using their voice, andoptionally on a sensitivity of the requested information. A verbalrequest to open an application on a password protected or locked device,however, may be blocked in some embodiments. In other embodiments,applications may be opened using voice commands based at least in parton a confidence score that the user is an authorized user for accessingthe device.

FIGS. 12-13 are schematic illustrations of example use cases 1200 forselective requests for passwords for voice-forward requests forinformation or applications in accordance with one or more exampleembodiments of the disclosure. In FIG. 12, at a first instance 1220, adevice 1212 may be docked at a docking station 1214 and may be in avoice-forward operation mode. The device may be used by a user to locaterestaurant reviews using voice input. The user may be interacting withthe device using voice commands. For example, the user may say “show menew restaurants.” In response, the device may present informationrelated to new restaurants, along with restaurant reviews. the user mayselect a restaurant for more information using a voice input. Forexample, the user may say “tell me more about Katana.” Responsive to therequest, the device may present specific content for the Katanarestaurant listing. The user may request that a restaurant menu bepresented.

At a second instance 1240, the user may provide a voice input 1230 ofrequesting that a reservation be made. This may be determined to be asensitive request because it relates to user account-specificinformation. As a result, the device may attempt to identify the speakerusing the voice input data. If the speaker cannot be identified, thedevice may prompt the user for a password prior to proceeding with thereservation. If the user can be identified as an authorized user, thedevice may bypass the password and proceed with making the reservationfor the user without requesting a password. The device 1212 maydetermine that the user is an authorized user, for example based onanalysis of the user's voice, and may proceed with making thereservation for the user, as illustrated in FIG. 12.

In in another use case 1300 at FIG. 13, at a third instance 1310, theuser may say “can you make me a reservation for tonight?” If the speakeror user cannot be identified or otherwise be determined to be anauthorized user, the device may prompt the user for a password prior toproceeding with the reservation.

At a fourth instance 1340, the device may audibly request a password1330 and present a password input interface for the user to input adevice password before proceeding with making the reservation. If thepassword is input or other authentication is confirmed, the device maytransition to an unlocked state for a certain period of time beforereturning to a locked state.

To identify the user, the device may send a request for speakeridentification to a remote server, where the response to the speakeridentification request represents a likelihood that the user that spokethe voice input is authorized to access applications on the device oraccess information using the device. In some embodiments, attributes ofat least a portion of voice data may be compared to attributes of astored voice data sample. The confidence score may be indicative of alikelihood that the user is authorized to access the information.

In some embodiments, password bypass functionality may only be availablewhen the device is docked or in a certain operation mode, while in otherembodiments, password bypass functionality may always be available.

In some instances, a determination may be made, for example using amicrophone or camera, that the user is physically present withinproximity of the device prior to bypassing a password. Access toapplications on a device, as opposed to services or information, may beprevented without a password in some embodiments.

One or more operations of the methods, process flows, or use cases ofFIGS. 1-13 may have been described above as being performed by a userdevice, or more specifically, by one or more program module(s),applications, or the like executing on a device. It should beappreciated, however, that any of the operations of the methods, processflows, or use cases of FIGS. 1-13 may be performed, at least in part, ina distributed manner by one or more other devices, or more specifically,by one or more program module(s), applications, or the like executing onsuch devices. In addition, it should be appreciated that the processingperformed in response to the execution of computer-executableinstructions provided as part of an application, program module, or thelike may be interchangeably described herein as being performed by theapplication or the program module itself or by a device on which theapplication, program module, or the like is executing. While theoperations of the methods, process flows, or use cases of FIGS. 1-13 maybe described in the context of the illustrative devices, it should beappreciated that such operations may be implemented in connection withnumerous other device configurations.

The operations described and depicted in the illustrative methods,process flows, and use cases of FIGS. 1-13 may be carried out orperformed in any suitable order as desired in various exampleembodiments of the disclosure. Additionally, in certain exampleembodiments, at least a portion of the operations may be carried out inparallel. Furthermore, in certain example embodiments, less, more, ordifferent operations than those depicted in FIGS. 1-13 may be performed.

Although specific embodiments of the disclosure have been described, oneof ordinary skill in the art will recognize that numerous othermodifications and alternative embodiments are within the scope of thedisclosure. For example, any of the functionality and/or processingcapabilities described with respect to a particular device or componentmay be performed by any other device or component. Further, whilevarious illustrative implementations and architectures have beendescribed in accordance with embodiments of the disclosure, one ofordinary skill in the art will appreciate that numerous othermodifications to the illustrative implementations and architecturesdescribed herein are also within the scope of this disclosure.

Certain aspects of the disclosure are described above with reference toblock and flow diagrams of systems, methods, apparatuses, and/orcomputer program products according to example embodiments. It will beunderstood that one or more blocks of the block diagrams and flowdiagrams, and combinations of blocks in the block diagrams and the flowdiagrams, respectively, may be implemented by execution ofcomputer-executable program instructions. Likewise, some blocks of theblock diagrams and flow diagrams may not necessarily need to beperformed in the order presented, or may not necessarily need to beperformed at all, according to some embodiments. Further, additionalcomponents and/or operations beyond those depicted in blocks of theblock and/or flow diagrams may be present in certain embodiments.

Accordingly, blocks of the block diagrams and flow diagrams supportcombinations of means for performing the specified functions,combinations of elements or steps for performing the specifiedfunctions, and program instruction means for performing the specifiedfunctions. It will also be understood that each block of the blockdiagrams and flow diagrams, and combinations of blocks in the blockdiagrams and flow diagrams, may be implemented by special-purpose,hardware-based computer systems that perform the specified functions,elements or steps, or combinations of special-purpose hardware andcomputer instructions.

Illustrative Device Architecture

FIG. 14 is a schematic block diagram of an illustrative device 1400 inaccordance with one or more example embodiments of the disclosure. Thedevice 1400 may include any suitable computing device capable ofreceiving and/or generating data including, but not limited to, a mobiledevice such as a smartphone, tablet, e-reader, wearable device, or thelike; a desktop computer; a laptop computer; a content streaming device;a set-top box; or the like. The device 1400 may correspond to anillustrative device configuration for the devices of FIGS. 1-13.

The device 1400 may be configured to communicate via one or morenetworks with one or more servers, search engines, user devices, or thelike. In some embodiments, a single device or single group of devicesmay be configured to perform more than one type of device operation modemanagement functionality.

Example network(s) may include, but are not limited to, any one or moredifferent types of communications networks such as, for example, cablenetworks, public networks (e.g., the Internet), private networks (e.g.,frame-relay networks), wireless networks, cellular networks, telephonenetworks (e.g., a public switched telephone network), or any othersuitable private or public packet-switched or circuit-switched networks.Further, such network(s) may have any suitable communication rangeassociated therewith and may include, for example, global networks(e.g., the Internet), metropolitan area networks (MANs), wide areanetworks (WANs), local area networks (LANs), or personal area networks(PANs). In addition, such network(s) may include communication links andassociated networking devices (e.g., link-layer switches, routers, etc.)for transmitting network traffic over any suitable type of mediumincluding, but not limited to, coaxial cable, twisted-pair wire (e.g.,twisted-pair copper wire), optical fiber, a hybrid fiber-coaxial (HFC)medium, a microwave medium, a radio frequency communication medium, asatellite communication medium, or any combination thereof.

In an illustrative configuration, the device 1400 may include one ormore processors (processor(s)) 1402, one or more memory devices 1404(generically referred to herein as memory 1404), one or moreinput/output (I/O) interface(s) 1406, one or more network interface(s)1408, one or more sensors or sensor interface(s) 1410, one or moretransceivers 1412, one or more optional speakers 1414, one or moreoptional microphones 1416, and data storage 1420. The device 1400 mayfurther include one or more buses 1418 that functionally couple variouscomponents of the device 1400. The device 1400 may further include oneor more antenna(e) 1434 that may include, without limitation, a cellularantenna for transmitting or receiving signals to/from a cellular networkinfrastructure, an antenna for transmitting or receiving Wi-Fi signalsto/from an access point (AP), a Global Navigation Satellite System(GNSS) antenna for receiving GNSS signals from a GNSS satellite, aBluetooth antenna for transmitting or receiving Bluetooth signals, aNear Field Communication (NFC) antenna for transmitting or receiving NFCsignals, and so forth. These various components will be described inmore detail hereinafter.

The bus(es) 1418 may include at least one of a system bus, a memory bus,an address bus, or a message bus, and may permit exchange of information(e.g., data (including computer-executable code), signaling, etc.)between various components of the device 1400. The bus(es) 1418 mayinclude, without limitation, a memory bus or a memory controller, aperipheral bus, an accelerated graphics port, and so forth. The bus(es)1418 may be associated with any suitable bus architecture including,without limitation, an Industry Standard Architecture (ISA), a MicroChannel Architecture (MCA), an Enhanced ISA (EISA), a Video ElectronicsStandards Association (VESA) architecture, an Accelerated Graphics Port(AGP) architecture, a Peripheral Component Interconnects (PCI)architecture, a PCI-Express architecture, a Personal Computer MemoryCard International Association (PCMCIA) architecture, a Universal SerialBus (USB) architecture, and so forth.

The memory 1404 of the device 1400 may include volatile memory (memorythat maintains its state when supplied with power) such as random accessmemory (RAM) and/or non-volatile memory (memory that maintains its stateeven when not supplied with power) such as read-only memory (ROM), flashmemory, ferroelectric RAM (FRAM), and so forth. Persistent data storage,as that term is used herein, may include non-volatile memory. In certainexample embodiments, volatile memory may enable faster read/write accessthan non-volatile memory. However, in certain other example embodiments,certain types of non-volatile memory (e.g., FRAM) may enable fasterread/write access than certain types of volatile memory.

In various implementations, the memory 1404 may include multipledifferent types of memory such as various types of static random accessmemory (SRAM), various types of dynamic random access memory (DRAM),various types of unalterable ROM, and/or writeable variants of ROM suchas electrically erasable programmable read-only memory (EEPROM), flashmemory, and so forth. The memory 1404 may include main memory as well asvarious forms of cache memory such as instruction cache(s), datacache(s), translation lookaside buffer(s) (TLBs), and so forth. Further,cache memory such as a data cache may be a multi-level cache organizedas a hierarchy of one or more cache levels (L1, L2, etc.).

The data storage 1420 may include removable storage and/or non-removablestorage including, but not limited to, magnetic storage, optical diskstorage, and/or tape storage. The data storage 1420 may providenon-volatile storage of computer-executable instructions and other data.The memory 1404 and the data storage 1420, removable and/ornon-removable, are examples of computer-readable storage media (CRSM) asthat term is used herein.

The data storage 1420 may store computer-executable code, instructions,or the like that may be loadable into the memory 1404 and executable bythe processor(s) 1402 to cause the processor(s) 1402 to perform orinitiate various operations. The data storage 1420 may additionallystore data that may be copied to memory 1404 for use by the processor(s)1402 during the execution of the computer-executable instructions.Moreover, output data generated as a result of execution of thecomputer-executable instructions by the processor(s) 1402 may be storedinitially in memory 1404, and may ultimately be copied to data storage1420 for non-volatile storage.

More specifically, the data storage 1420 may store one or more operatingsystems (O/S) 1422; one or more database management systems (DBMS) 1424;and one or more program module(s), applications, engines,computer-executable code, scripts, or the like such as, for example, oneor more awake state module(s) 1426, one or more communication module(s)1428, one or more operation mode management module(s) 1430, and/or oneor more speaker identification module(s) 1432. Some or all of thesemodule(s) may be sub-module(s). Any of the components depicted as beingstored in data storage 1420 may include any combination of software,firmware, and/or hardware. The software and/or firmware may includecomputer-executable code, instructions, or the like that may be loadedinto the memory 1404 for execution by one or more of the processor(s)1402. Any of the components depicted as being stored in data storage1420 may support functionality described in reference to correspondinglynamed components earlier in this disclosure.

The data storage 1420 may further store various types of data utilizedby components of the device 1400. Any data stored in the data storage1420 may be loaded into the memory 1404 for use by the processor(s) 1402in executing computer-executable code. In addition, any data depicted asbeing stored in the data storage 1420 may potentially be stored in oneor more datastore(s) and may be accessed via the DBMS 1424 and loaded inthe memory 1404 for use by the processor(s) 1402 in executingcomputer-executable code. The datastore(s) may include, but are notlimited to, databases (e.g., relational, object-oriented, etc.), filesystems, flat files, distributed datastores in which data is stored onmore than one node of a computer network, peer-to-peer networkdatastores, or the like. In FIG. 14, the datastore(s) may include, forexample, operating mode settings for various applications, authorizedspeaker or user data, docked operating mode settings, and otherinformation.

The processor(s) 1402 may be configured to access the memory 1404 andexecute computer-executable instructions loaded therein. For example,the processor(s) 1402 may be configured to execute computer-executableinstructions of the various program module(s), applications, engines, orthe like of the device 1400 to cause or facilitate various operations tobe performed in accordance with one or more embodiments of thedisclosure. The processor(s) 1402 may include any suitable processingunit capable of accepting data as input, processing the input data inaccordance with stored computer-executable instructions, and generatingoutput data. The processor(s) 1402 may include any type of suitableprocessing unit including, but not limited to, a central processingunit, a microprocessor, a Reduced Instruction Set Computer (RISC)microprocessor, a Complex Instruction Set Computer (CISC)microprocessor, a microcontroller, an Application Specific IntegratedCircuit (ASIC), a Field-Programmable Gate Array (FPGA), aSystem-on-a-Chip (SoC), a digital signal processor (DSP), and so forth.Further, the processor(s) 1402 may have any suitable microarchitecturedesign that includes any number of constituent components such as, forexample, registers, multiplexers, arithmetic logic units, cachecontrollers for controlling read/write operations to cache memory,branch predictors, or the like. The microarchitecture design of theprocessor(s) 1402 may be capable of supporting any of a variety ofinstruction sets.

Referring now to functionality supported by the various programmodule(s) depicted in FIG. 14, the awake state module(s) 1426 mayinclude computer-executable instructions, code, or the like thatresponsive to execution by one or more of the processor(s) 1402 mayperform functions including, but not limited to, determining whether acomputer processor is being held in an awake or always on state,determining whether a device component is being held in an awake state,determining whether an application is active or in a foreground of acomputing environment, determining whether a display is being held in anawake state, determining active applications, and the like.

The communication module(s) 1428 may include computer-executableinstructions, code, or the like that responsive to execution by one ormore of the processor(s) 1402 may perform functions including, but notlimited to, communicating with one or more devices, for example, viawired or wireless communication, communicating with remote servers,communicating with remote datastores, sending or receiving voice data,communicating with cache memory data, and the like.

The operation mode management module(s) 1430 may includecomputer-executable instructions, code, or the like that responsive toexecution by one or more of the processor(s) 1402 may perform functionsincluding, but not limited to, determining an active or current deviceoperation mode, causing changes to device operation modes, canceling ordeferring automatic changes to device operation modes, determining voicecommands or voice inputs, and the like.

The speaker identification module(s) 1432 may includecomputer-executable instructions, code, or the like that responsive toexecution by one or more of the processor(s) 1402 may perform functionsincluding, but not limited to, determining wake words, determining voicedata or voice commands, identifying speakers of voice inputs,determining confidence scores, comparing attributes of voice input tostored data, and the like.

Referring now to other illustrative components depicted as being storedin the data storage 1420, the O/S 1422 may be loaded from the datastorage 1420 into the memory 1404 and may provide an interface betweenother application software executing on the device 1400 and hardwareresources of the device 1400. More specifically, the O/S 1422 mayinclude a set of computer-executable instructions for managing hardwareresources of the device 1400 and for providing common services to otherapplication programs (e.g., managing memory allocation among variousapplication programs). In certain example embodiments, the O/S 1422 maycontrol execution of the other program module(s) to dynamically enhancecharacters for content rendering. The O/S 1422 may include any operatingsystem now known or which may be developed in the future including, butnot limited to, any server operating system, any mainframe operatingsystem, or any other proprietary or non-proprietary operating system.

The DBMS 1424 may be loaded into the memory 1404 and may supportfunctionality for accessing, retrieving, storing, and/or manipulatingdata stored in the memory 1404 and/or data stored in the data storage1420. The DBMS 1424 may use any of a variety of database models (e.g.,relational model, object model, etc.) and may support any of a varietyof query languages. The DBMS 1424 may access data represented in one ormore data schemas and stored in any suitable data repository including,but not limited to, databases (e.g., relational, object-oriented, etc.),file systems, flat files, distributed datastores in which data is storedon more than one node of a computer network, peer-to-peer networkdatastores, or the like. In those example embodiments in which thedevice 1400 is a mobile device, the DBMS 1424 may be any suitablelight-weight DBMS optimized for performance on a mobile device.

Referring now to other illustrative components of the device 1400, theinput/output (I/O) interface(s) 1406 may facilitate the receipt of inputinformation by the device 1400 from one or more I/O devices as well asthe output of information from the device 1400 to the one or more I/Odevices. The I/O devices may include any of a variety of components suchas a display or display screen having a touch surface or touchscreen; anaudio output device for producing sound, such as a speaker; an audiocapture device, such as a microphone; an image and/or video capturedevice, such as a camera; a haptic unit; and so forth. Any of thesecomponents may be integrated into the device 1400 or may be separate.The I/O devices may further include, for example, any number ofperipheral devices such as data storage devices, printing devices, andso forth.

The I/O interface(s) 1406 may also include an interface for an externalperipheral device connection such as universal serial bus (USB),FireWire, Thunderbolt, Ethernet port or other connection protocol thatmay connect to one or more networks. The I/O interface(s) 1406 may alsoinclude a connection to one or more of the antenna(e) 1434 to connect toone or more networks via a wireless local area network (WLAN) (such asWi-Fi) radio, Bluetooth, ZigBee, and/or a wireless network radio, suchas a radio capable of communication with a wireless communicationnetwork such as a Long Term Evolution (LTE) network, WiMAX network, 3Gnetwork, ZigBee network, etc.

The device 1400 may further include one or more network interface(s)1408 via which the device 1400 may communicate with any of a variety ofother systems, platforms, networks, devices, and so forth. The networkinterface(s) 1408 may enable communication, for example, with one ormore wireless routers, one or more host servers, one or more webservers, and the like via one or more of networks.

The antenna(e) 1434 may include any suitable type of antenna depending,for example, on the communications protocols used to transmit or receivesignals via the antenna(e) 1434. Non-limiting examples of suitableantennas may include directional antennas, non-directional antennas,dipole antennas, folded dipole antennas, patch antennas, multiple-inputmultiple-output (MIMO) antennas, or the like. The antenna(e) 1434 may becommunicatively coupled to one or more transceivers 1412 or radiocomponents to which or from which signals may be transmitted orreceived.

As previously described, the antenna(e) 1434 may include a cellularantenna configured to transmit or receive signals in accordance withestablished standards and protocols, such as Global System for MobileCommunications (GSM), 3G standards (e.g., Universal MobileTelecommunications System (UMTS), Wideband Code Division Multiple Access(W-CDMA), CDMA2000, etc.), 4G standards (e.g., Long-Term Evolution(LTE), WiMax, etc.), direct satellite communications, or the like.

The antenna(e) 1434 may additionally, or alternatively, include a Wi-Fiantenna configured to transmit or receive signals in accordance withestablished standards and protocols, such as the IEEE 802.11 family ofstandards, including via 2.4 GHz channels (e.g., 802.11b, 802.11g,802.11n), 5 GHz channels (e.g., 802.11n, 802.11ac), or 60 GHz channels(e.g., 802.11ad). In alternative example embodiments, the antenna(e)1434 may be configured to transmit or receive radio frequency signalswithin any suitable frequency range forming part of the unlicensedportion of the radio spectrum.

The antenna(e) 1434 may additionally, or alternatively, include a GNSSantenna configured to receive GNSS signals from three or more GNSSsatellites carrying time-position information to triangulate a positiontherefrom. Such a GNSS antenna may be configured to receive GNSS signalsfrom any current or planned GNSS such as, for example, the GlobalPositioning System (GPS), the GLONASS System, the Compass NavigationSystem, the Galileo System, or the Indian Regional Navigational System.

The transceiver(s) 1412 may include any suitable radio component(s)for—in cooperation with the antenna(e) 1434—transmitting or receivingradio frequency (RF) signals in the bandwidth and/or channelscorresponding to the communications protocols utilized by the device1400 to communicate with other devices. The transceiver(s) 1412 mayinclude hardware, software, and/or firmware for modulating,transmitting, or receiving—potentially in cooperation with any ofantenna(e) 1434—communications signals according to any of thecommunications protocols discussed above including, but not limited to,one or more Wi-Fi and/or Wi-Fi direct protocols, as standardized by theIEEE 802.11 standards, one or more non-Wi-Fi protocols, or one or morecellular communications protocols or standards. The transceiver(s) 1412may further include hardware, firmware, or software for receiving GNSSsignals. The transceiver(s) 1412 may include any known receiver andbaseband suitable for communicating via the communications protocolsutilized by the device 1400. The transceiver(s) 1412 may further includea low noise amplifier (LNA), additional signal amplifiers, ananalog-to-digital (A/D) converter, one or more buffers, a digitalbaseband, or the like.

The sensor(s)/sensor interface(s) 1410 may include or may be capable ofinterfacing with any suitable type of sensing device such as, forexample, inertial sensors, force sensors, thermal sensors, and so forth.Example types of inertial sensors may include accelerometers (e.g.,MEMS-based accelerometers), gyroscopes, and so forth.

The optional speaker(s) 1414 may be any device configured to generateaudible sound. The optional microphone(s) 1416 may be any deviceconfigured to receive analog sound input or voice data.

It should be appreciated that the program module(s), applications,computer-executable instructions, code, or the like depicted in FIG. 14as being stored in the data storage 1420 are merely illustrative and notexhaustive and that processing described as being supported by anyparticular module may alternatively be distributed across multiplemodule(s) or performed by a different module. In addition, variousprogram module(s), script(s), plug-in(s), Application ProgrammingInterface(s) (API(s)), or any other suitable computer-executable codehosted locally on the device 1400, and/or hosted on other computingdevice(s) accessible via one or more networks, may be provided tosupport functionality provided by the program module(s), applications,or computer-executable code depicted in FIG. 14 and/or additional oralternate functionality. Further, functionality may be modularizeddifferently such that processing described as being supportedcollectively by the collection of program module(s) depicted in FIG. 14may be performed by a fewer or greater number of module(s), orfunctionality described as being supported by any particular module maybe supported, at least in part, by another module. In addition, programmodule(s) that support the functionality described herein may form partof one or more applications executable across any number of systems ordevices in accordance with any suitable computing model such as, forexample, a client-server model, a peer-to-peer model, and so forth. Inaddition, any of the functionality described as being supported by anyof the program module(s) depicted in FIG. 14 may be implemented, atleast partially, in hardware and/or firmware across any number ofdevices.

It should further be appreciated that the device 1400 may includealternate and/or additional hardware, software, or firmware componentsbeyond those described or depicted without departing from the scope ofthe disclosure. More particularly, it should be appreciated thatsoftware, firmware, or hardware components depicted as forming part ofthe device 1400 are merely illustrative and that some components may notbe present or additional components may be provided in variousembodiments. While various illustrative program module(s) have beendepicted and described as software module(s) stored in data storage1420, it should be appreciated that functionality described as beingsupported by the program module(s) may be enabled by any combination ofhardware, software, and/or firmware. It should further be appreciatedthat each of the above-mentioned module(s) may, in various embodiments,represent a logical partitioning of supported functionality. Thislogical partitioning is depicted for ease of explanation of thefunctionality and may not be representative of the structure ofsoftware, hardware, and/or firmware for implementing the functionality.Accordingly, it should be appreciated that functionality described asbeing provided by a particular module may, in various embodiments, beprovided at least in part by one or more other module(s). Further, oneor more depicted module(s) may not be present in certain embodiments,while in other embodiments, additional module(s) not depicted may bepresent and may support at least a portion of the describedfunctionality and/or additional functionality. Moreover, while certainmodule(s) may be depicted and described as sub-module(s) of anothermodule, in certain embodiments, such module(s) may be provided asindependent module(s) or as sub-module(s) of other module(s).

The system may operate using various components as described in FIG. 15.The various components illustrated FIG. 15 may be located on the same ordifferent physical devices. Communication between various componentsillustrated in FIG. 15 may occur directly or across one or morenetwork(s). The system of FIG. 15 may include one or more server(s) 1520and one or more skill server(s) 1540 that may be in communication usingone or more networks.

A device 1510 captures audio 1500 using an audio capture component, suchas a microphone or array of microphones. The device 1510, using awakeword detection component 1530, processes audio data corresponding tothe audio 1500 to determine if a keyword (e.g., a wakeword) is detectedin the audio data. Following detection of a wakeword, the device 1510sends audio data 1512, corresponding to the audio 1500, to the one ormore server(s) 1520.

Upon receipt by the server(s) 1520, the audio data 1512 may be sent toan orchestrator component 1570. The orchestrator component 1570 mayinclude memory and logic that enables the orchestrator component 1570 totransmit various pieces and forms of data to various components of thesystem.

The orchestrator component 1570 sends the audio data 1512 to a speechprocessing component 1550. An ASR component 1552 of the speechprocessing component 1550 transcribes the audio data 1512 into one ormore textual interpretations representing speech contained in the audiodata 1512. The ASR component 1552 interprets the spoken utterance basedon a similarity between the spoken utterance and pre-establishedlanguage models. For example, the ASR component 1552 may compare theaudio data 1512 with models for sounds (e.g., subword units such asphonemes) and sequences of sounds to identify words that match thesequence of sounds spoken in the utterance represented in the audio data1512. The ASR component 1552 sends text data generated thereby to an NLUcomponent 1554 of the speech processing component 1550. The text datasent from the ASR component 1552 to the NLU component 1554 may include atop scoring textual interpretation of the audio data 1512 or may includean N-best list including a group of textual interpretations of the audiodata 1512, and potentially their respective scores.

The NLU component 1554 attempts to make a semantic interpretation of thephrases or statements represented in the text data input therein. Thatis, the NLU component 1554 determines one or more meanings associatedwith the phrases or statements represented in the text data based onindividual words represented in the text data. The NLU component 1554interprets a text string to derive an intent of the user (e.g., anaction that the user desires be performed) as well as pertinent piecesof information in the text data that allow a device (e.g., the device1510, the server(s) 1520, the skill server(s) 1540, etc.) to completethe intent. For example, if the text data corresponds to “play music,”the NLU component 1554 may determine the user intended music to beoutput from one or more devices.

The server(s) 1520 may include a user recognition component 1560. Theuser recognition component 1560 may determine user that most likelyspoke an input utterance as explained below.

The server(s) 1520 may include a profile storage 1572. The profilestorage 1572 may include a variety of information related to individualdevices, groups of devices, individual users, groups of users, etc. thatinteract with the system as described below.

The orchestrator component 1570 may send output from the NLU component1554 (e.g., text data including tags attributing meaning to the wordsand phrases represented in the text data), and optionally output fromthe user recognition component 1560 and/or data from the profile storage1572, to one or more speechlets 1590 and/or the one or more skillservers 1540 implementing one or more skills.

A “speechlet” may be software running on the server(s) 1520 that is akinto a software application running on a traditional desktop computer.That is, a speechlet 1590 may enable the server(s) 1520 to executespecific functionality in order to provide data or produce some otheroutput requested by a user. The server(s) 1520 may be configured withmore than one speechlet 1590. For example, a weather service speechletmay enable the server(s) 1520 to provide weather information, a carservice speechlet may enable the server(s) 1520 to book a trip withrespect to a taxi or ride sharing service, an order pizza speechlet mayenable the server(s) 1520 to order a pizza with respect to arestaurant's online ordering system, etc. A speechlet may operate inconjunction between the server(s) 1520 and other devices such as a localdevice 1510 in order to complete certain functions. Inputs to thespeechlet may come from speech processing interactions or through otherinteractions or input sources. In some embodiments, speechlets may sendsignals or data to client devices that cause the client device toactivate a voice-forward operating mode or a tablet operating mode. Acurrent operating mode of a client device may be stored at the server1520. In some embodiments, a tablet-management speechlet may be includedand may send a directive or command to a client device, such as atablet, that causes the device to activate or switch into certainoperating modes.

A speechlet may include a “skill.” A skill may be software running on askill server(s) 1540 that is akin to an application. That is, a skillmay enable the skill server(s) 1540 to execute specific functionality inorder to provide data or produce some other output requested by a user.A skill server(s) 1540 may be configured with more than one skill. Forexample, a weather service skill may enable the skill server(s) 1540 toprovide weather information to the server(s) 1540, a car service skillmay enable the skill server(s) 1540 to book a trip with respect to ataxi or ride sharing service, an order pizza skill may enable the skillserver(s) 1540 to order a pizza with respect to a restaurant's onlineordering system, etc. A skill may operate in conjunction between theskill server(s) 1540 and other devices such as the server(s) 1540 orlocal device 110 in order to complete certain functions. Inputs to theskill may come from speech processing interactions or through otherinteractions or input sources. Skills may be associated with certainclient devices while the client device is in a voice-forward mode. Forexample, while in a voice-forward mode, a client device may beassociated with a music skill that can be used to cause playback ofmusic using voice commands received at the client device.

The functions provided by one or more speechlets 1590 may overlap or bedifferent from the functions provided by one or more skills. Speechlets1590 may be implemented in some combination of hardware, software,firmware, etc.

The orchestrator component 1570 may choose which speechlet(s) 1590and/or skill server(s) 1540 to send data to based on the output of theNLU component 1554. In an example, the orchestrator component 1570 maysend data to a music playing speechlet(s) 1590 and/or skill server(s)1540 when the NLU component 1554 outputs text data associated with acommand to play music. In another example, the orchestrator component1570 may send data to a weather speech(s) 1590 and/or skill server(s)1540 when the NLU component 1554 outputs text data associated with acommand to output weather information. In yet another example, theorchestrator component 1570 may send data to a search enginespeechlet(s) 1590 and/or skill server(s) 1540 when the NLU component1554 outputs text data associated with a command to obtain searchresults.

Speechlets 1590 and skill servers 1540 may output text data, which theorchestrator component 1570 may send to a text-to-speech (TTS) component1592. The TTS component 1592 may synthesize speech corresponding to thetext data input therein. The orchestrator component 1570 or othercomponent of the server(s) 1540 may send audio data synthesized by theTTS component 1592 (or other output data from speechlet(s) 1590 or skillserver(s) 1540) to the device 1510 (or another device including aspeaker and associated with the same user ID or customer ID) for outputto one or more users.

The TTS component 1592 may perform speech synthesis using one or moredifferent methods. In one method of synthesis called unit selection, theTTS component 1592 matches text data against a database of recordedspeech. Matching units are selected and concatenated together to formaudio data. In another method of synthesis called parametric synthesis,the TTS component 1592 varies parameters such as frequency, volume, andnoise to create an artificial speech waveform output. Parametricsynthesis uses a computerized voice generator, sometimes called avocoder.

The various components may exist in software, hardware, firmware, orsome combination thereof.

The user recognition component 1560 may recognize one or more usersusing a variety of data. As illustrated in FIG. 15, the user recognitioncomponent 1560 may include one or more subcomponents including a visioncomponent 1561, an audio component 1562, a biometric component 1563, aradio frequency (RF) component 1564, a machine learning (ML) component1565, and a recognition confidence component 1566. In some instances,the user recognition component 1560 may monitor data and determinationsfrom one or more subcomponents to determine an identity of one or moreusers in an environment. The user recognition component 1560 may outputuser recognition data 1580, which may include a user identifierassociated with a user the system believes is interacting with thesystem. The user recognition data 1580 may be used to inform NLUcomponent 1554 processes as well as processing performed by speechlets1590, skill servers 1540, routing of output data, permission access tofurther information, etc.

The vision component 1561 may receive data from one or more sensorscapable of providing images (e.g., cameras) or sensors indicating motion(e.g., motion sensors). The vision component 1561 can perform facialrecognition or image analysis to determine an identity of a user and toassociate that identity with a user profile associated with the user. Insome instances, when a user is facing a camera, the vision component1561 may perform facial recognition and identify the user with a highdegree of confidence. In other instances, the vision component 1561 mayhave a low degree of confidence of an identity of a user, and the userrecognition component 1560 may utilize determinations from additionalcomponents to determine an identity of a user. The vision component 1561can be used in conjunction with other components to determine anidentity of a user. For example, the user recognition component 1560 mayuser data from the vision component 1516 with data from the audiocomponent 1562 to identify what user's face appears to be speaking atthe same time audio is captured by a device the user is facing forpurposes of identifying a user who spoke an utterance.

The system may include biometric sensors that transmit data to thebiometric component 1563. For example, the biometric component 1563 mayreceive data corresponding to fingerprints, iris or retina scans,thermal scans, weights of users, a size of a user, pressure (e.g.,within floor sensors), etc., and may determine a biometric profilecorresponding to a user. The biometric component 1563 may distinguishbetween a user and sound from a television, for example. Thus, thebiometric component 1563 may incorporate biometric information into aconfidence level for determining an identity of a user. Biometricinformation output by the biometric component 1563 can be associatedwith a specific user profile such that the biometric informationuniquely identifies a user profile of a user.

The RF component 1564 may use RF localization to track devices that auser may carry or wear. For example, a user (and a user profileassociated with the user) may be associated with a computing device. Thecomputing device may emit RF signals (e.g., Wi-Fi, Bluetooth®, etc.). Adevice may detect the signal and indicate to the RF component 1564 thestrength of the signal (e.g., as a received signal strength indication(RSSI)). The RF component 1564 may use the RSSI to determine an identityof a user (with an associated confidence level). In some instances, theRF component 1564 may determine that a received RF signal is associatedwith a mobile device that is associated with a particular user.

In some instances, a device 1510 may include some RF or other detectionprocessing capabilities so that a user who speaks an utterance may scan,tap, or otherwise acknowledge his/her personal device (such as a phone)to the device 1510. In this manner, the user may “register” with thesystem for purposes of the system determining who spoke a particularutterance. Such a registration may occur prior to, during, or afterspeaking of an utterance.

The ML component 1565 may track the behavior of various users in theenvironment as a factor in determining a confidence level of theidentity of the user. By way of example, a user may adhere to a regularschedule such that the user is outside the environment during the day(e.g., at work or at school). In this example, the ML component 1565would factor in past behavior and/or trends into determining theidentity of the user that spoke an utterance to the system. Thus, the MLcomponent 1565 may user historical data and/or usage patterns over timeto increase or decrease a confidence level of an identity of a user.

In some instances, the recognition confidence component 1566 receivesdeterminations from the various components, and may determine a finalconfidence level associated with the identity of a user. In someinstances, the confidence level may determine whether an action isperformed. For example, if a user request includes a request to unlock adoor, a confidence level may need to be above a threshold that may behigher than a confidence level needed to perform a user requestassociated with playing a playlist or resuming a location in anaudiobook. The confidence level or other score data may be included inthe user recognition data 1580.

The audio component 1562 may receive data from one or more sensorscapable of providing an audio signal (e.g., the device 1510, one or moremicrophones, etc.) to facilitate recognizing a user. The audio component1562 may perform audio recognition on an audio signal to determine anidentity of the user and an associated user profile. In some instances,aspects of the server(s) 1520 may be configured at a computing device(e.g., a local server) within the environment 202. Thus, in someinstances, the audio component 1562 operating on a computing device inthe environment may analyze all sound within the environment (e.g.,without requiring a wake word) to facilitate recognizing a user. In someinstances, the audio component 1562 may perform voice recognition todetermine an identity of a user.

The audio component 1562 may also determine whether a user correspondsto a child or not a child based on audio characteristics. The audiocomponent 1562 may include a model trained with respect to speechcharacteristics common to children. Using the trained model, the audiocomponent 1562 may make a binary determination regarding whether theuser that spoke the command is a child. The trained model(s) maydetermine a child is speaking based on acoustic properties of audio(e.g., pitch, prosody, energy) as well as other data/characteristics(e.g., vocabulary, sentence structure, direction of where audio of anutterance is received from (since children are shorter than adults)).

Child detection can be performed independently of user identity. Forexample, the system may use user recognition techniques and not be ableto identify the specific speaking user, but may still be able to tellthat the speaking user is a child or non-adult.

The audio component 1562 may also perform user identification based oninformation relating to a spoken utterance input into the system forspeech processing. For example, the audio component 1562 may take asinput the audio data 1512 and/or output data from the ASR component1552. The audio component 1562 may determine scores indicating whetherthe command originated from particular users. For example, a first scoremay indicate a likelihood that the command originated from a first user,a second score may indicate a likelihood that the command originatedfrom a second user, etc. The audio component 1562 may perform userrecognition by comparing speech characteristics in the audio data 1512to stored speech characteristics of users.

FIG. 16 illustrates the audio component 1562 of the user recognitioncomponent 1560 performing user recognition using audio data, for exampleinput audio data 1512 corresponding to an input utterance. In additionto outputting text data as described above, the ASR component 1552 mayalso output ASR confidence data 1660, which is passed to the userrecognition component 1560. The audio component 1562 performs userrecognition using various data including the audio data 1512, trainingdata 1610 corresponding to sample audio data corresponding to knownusers, the ASR confidence data 1660, and secondary data 1650. The audiocomponent 1562 may output user recognition confidence data 1640 thatreflects a certain confidence that the input utterance was spoken by oneor more particular users. The user recognition confidence data 1640 mayinclude an indicator of a verified user (such as a user ID correspondingto the speaker of the utterance) along with a confidence valuecorresponding to the user ID, such as a numeric value or binned value asdiscussed below. The user recognition confidence data 1640 may be usedby various components, including other components of the userrecognition component 1560 to recognize a user.

The training data 1610 may be stored in a user recognition data storage1600. The user recognition data storage 1600 may be stored by theserver(s) 1540, or may be a separate device. Further, the userrecognition data storage 1600 may be part of a user profile in theprofile storage 1572. The user recognition data storage 1600 may be acloud-based storage. The training data 1610 stored in the userrecognition data storage 1600 may be stored as waveforms and/orcorresponding features/vectors. The training data 1610 may correspond todata from various audio samples, each audio sample associated with aknown user and/or user identity. The audio samples may correspond tovoice profile data for one or more users. For example, each user knownto the system may be associated with some set of training data1610/voice profile data for the known user. Thus, the training data 1610may include a biometric representation of a user's voice. The audiocomponent 1562 may then use the training data 1610 to compare againstincoming audio data 1512 to determine the identity of a user speaking anutterance. The training data 1610 stored in the user recognition datastorage 1600 may thus be associated with multiple users of multipledevices. Thus, the training data 1610 stored in the user recognitiondata storage 1600 may be associated with both a user that spoke therespective utterance, as well as the device 1510 that captured therespective utterance.

To perform user recognition, the audio component 1562 may determine thedevice 1510 from which the audio data 1512 originated. For example, theaudio data 1512 may include a tag or other metadata indicating thedevice 1510. Either the device 1510 or the server(s) 1540 may tag theaudio data 1512 as such. The user recognition component 1560 may send asignal to the user recognition data storage 1600, with the signalrequesting only training data 1610 associated with known users of thedevice 1510 from which the audio data 1512 originated. This may includeaccessing a user profile(s) associated with the device 1510 and thenonly inputting training data 1610 associated with users corresponding tothe user profile(s) of the device 1510. This limits the universe ofpossible training data the audio component 1562 should consider atruntime when recognizing a user and thus decreases the amount of time toperform user recognition by decreasing the amount of training data 1610needed to be processed. Alternatively, the user recognition component1560 may access all (or some other subset of) training data 1610available to the system. Alternatively, the audio component 1562 mayaccess a subset of training data 1610 of users potentially within theenvironment of the device 1510 from which the audio data 1512originated, as may otherwise have been determined by the userrecognition component 1560.

If the audio component 1562 receives training data 1610 as an audiowaveform, the audio component 1562 may determine features/vectors of thewaveform(s) or otherwise convert the waveform into a data format thatcan be used by the audio component 1562 to actually perform the userrecognition. The audio component 1562 may then identify the user thatspoke the utterance in the audio data 1512 by comparing features/vectorsof the audio data 1512 to training features/vectors (either receivedfrom the user recognition data storage 1600 or determined from trainingdata 1610 received from the user recognition data storage 1600).

The audio component 1562 may include a scoring component 1620 whichdetermines respective scores indicating whether the input utterance(represented by the audio data 1512) was spoken by particular users(represented by the training data 1610). The audio component 1562 mayalso include a confidence component 1630 that determines an overallconfidence as the accuracy of the user recognition operations (such asthose of the scoring component 1620) and/or an individual confidence foreach user potentially identified by the scoring component 1620. Theoutput from the scoring component 1620 may include scores for all userswith respect to which user recognition was performed (e.g., all usersassociated with the device 1510). For example, the output may include afirst score for a first user, a second score for a second user, andthird score for a third user, etc. Although illustrated as two separatecomponents, the scoring component 1620 and confidence component 1630 maybe combined into a single component or may be separated into more thantwo components.

The scoring component 1620 and confidence component 1630 may implementone or more trained machine learning models (such neural networks,classifiers, etc.) as known in the art. For example, the scoringcomponent 1620 may use probabilistic linear discriminant analysis (PLDA)techniques. PLDA scoring determines how likely it is that an input audiodata feature vector corresponds to a particular training data featurevector for a particular user. The PLDA scoring may generate similarityscores for each training feature vector considered and may output thelist of scores and users and/or the user ID of the speaker whosetraining data feature vector most closely corresponds to the input audiodata feature vector. The scoring component 1620 may also use othertechniques such as GMMs, generative Bayesian models, or the like, todetermine similarity scores.

The confidence component 1630 may input various data includinginformation about the ASR confidence data 1660, utterance length (e.g.,number of frames or time of the utterance), audio condition/quality data(such as signal-to-interference data or other metric data), fingerprintdata, image data, or other factors to consider how confident the audiocomponent 1562 is with regard to the scores linking users to the inpututterance. The confidence component 1630 may also consider thesimilarity scores and user IDs output by the scoring component 1620.Thus, the confidence component 1630 may determine that a lower ASRconfidence represented in the ASR confidence data 1660, or poor inputaudio quality, or other factors, may result in a lower confidence of theaudio component 1562. Whereas a higher ASR confidence represented in theASR confidence data 1660, or better input audio quality, or otherfactors, may result in a higher confidence of the audio component 1562.Precise determination of the confidence may depend on configuration andtraining of the confidence component 1630 and the models used therein.The confidence component 1630 may operate using a number of differentmachine learning models/techniques such as GMM, neural networks, etc.For example, the confidence component 1630 may be a classifierconfigured to map a score output by the scoring component 1620 to aconfidence.

The audio component 1562 may output user recognition confidence data1640 specific to a single user, or multiple users in the form of anN-best list. For example, the audio component 1562 may output userrecognition confidence data 1640 with respect to each user indicated inthe profile associated with the device 1510 from which the audio data1512 was received. The audio component 1562 may also output userrecognition confidence data 1640 with respect to each user potentiallyin the location of the device 1510 from which the audio data 1512 wasreceived.

The user recognition confidence data 1640 may include particular scores(e.g., 0.0-1.0, 0-1000, or whatever scale the system is configured tooperate). Thus, the system may output an N-best list of potential userswith confidence scores (e.g., John—0.2, Jane—0.8). Alternatively or inaddition, the user recognition confidence data 1640 may include binnedrecognition indicators. For example, a computed recognition score of afirst range (e.g., 0.0-0.33) may be output as “low,” a computedrecognition score of a second range (e.g., 0.34-0.66) may be output as“medium,” and a computed recognition score of a third range (e.g.,0.67-1.0) may be output as “high.” Thus, the system may output an N-bestlist of potential users with binned scores (e.g., John—low, Jane—high).Combined binned and confidence score outputs are also possible. Ratherthan a list of users and their respective scores and/or bins, the userrecognition confidence data 1640 may only include information related tothe top scoring user as determined by the audio component 1562. Thescores and bins may be based on information determined by the confidencecomponent 1630. The audio component 1562 may also output a confidencevalue that the scores/bins are correct, where the confidence valueindicates how confident the audio component 1562 is in the outputresults. This confidence value may be determined by the confidencecomponent 1630.

The confidence component 1630 may determine individual user confidencesand differences between user confidences when determining the userrecognition confidence data 1640. For example, if a difference between afirst user's confidence score and a second user's confidence score islarge, and the first user's confidence score is above a threshold, thenthe audio component 510 is able to recognize the first user as the userthat spoke the utterance with a much higher confidence than if thedifference between the users' confidences were smaller.

The audio component 1562 may perform certain thresholding to avoidincorrect user recognition results being output. For example, the audiocomponent 1562 may compare a confidence score output by the confidencecomponent 1630 to a confidence threshold. If the confidence score is notabove the confidence threshold (for example, a confidence of “medium” orhigher), the user audio component 1562 may not output user recognitionconfidence data 1640, or may only include in that data 1640 anindication that a user speaking the utterance could not be verified.Further, the audio component 1562 may not output user recognitionconfidence data 1640 until enough input audio data 1512 is accumulatedand processed to verify the user above a threshold confidence. Thus, theaudio component 1562 may wait until a sufficient threshold quantity ofaudio data 1512 of the utterance has been processed before outputtinguser recognition confidence data 1640. The quantity of received audiodata 1512 may also be considered by the confidence component 1630.

The audio component 1562 may be defaulted to output binned (e.g., low,medium, high) user recognition confidence data 1640. However, such maybe problematic from the speechlet(s) 1590 and skill server(s) 1540perspectives. For example, if the audio component 1562 computes a singlebinned confidence for multiple users, a speechlet(s) 1590/skillserver(s) 1540 may not be able to determine which user to determinecontent with respect to. In this situation, the audio component 1562 maybe configured to override its default setting and output userrecognition confidence data 1640 including values (e.g., 0.0-1.0)associated with the users associated with the same binned confidence.This enables the speechlet(s) 1590/skill server(s) 1540 to selectcontent associated with the user associated with the highest confidencevalue. The user recognition confidence data 1640 may also include theuser IDs corresponding to the potential user(s) who spoke the utterance.

The user recognition component 1560 may combine data from components todetermine the identity of a particular user. As part of its audio-baseduser recognition operations, the audio component 1562 may use secondarydata 1650 to inform user recognition processing. Thus, a trained modelor other component of the audio component 1562 may be trained to takesecondary data 1650 as an input feature when performing recognition.Secondary data 1650 may include a wide variety of data types dependingon system configuration and may be made available from other sensors,devices, or storage such as user profile data, etc. The secondary data1650 may include a time of day at which the audio data 1512 wascaptured, a day of a week in which the audio data 1512 was captured, thetext data output by the ASR component 1552, NLU results data, and/orother data.

In one example, secondary data 1650 may include image data or videodata. For example, facial recognition may be performed on image data orvideo data received corresponding to the received audio data 1512.Facial recognition may be performed by the vision component 1561, or byanother component of the server(s) 1540. The output of the facialrecognition process may be used by the audio component 1562. That is,facial recognition output data may be used in conjunction with thecomparison of the features/vectors of the audio data 1512 and trainingdata 1610 to perform more accurate user recognition.

The secondary data 1650 may also include location data of the device1510. The location data may be specific to a building within which thedevice 1510 is located. For example, if the device 1510 is located inuser A's bedroom, such location may increase user recognition confidencedata associated with user A, but decrease user recognition confidencedata associated with user B.

The secondary data 1650 may also include data related to the profile ofthe device 1510. For example, the secondary data 1650 may furtherinclude type data indicating a type of the device 1510. Different typesof devices may include, for example, a smart watch, a smart phone, atablet computer, and a vehicle. The type of device may be indicated inthe profile associated with the device. For example, if the device 1510from which the audio data 1512 was received is a smart watch or vehiclebelonging to user A, the fact that the device 1510 belongs to user A mayincrease user recognition confidence data associated with user A, butdecrease user recognition confidence data associated with user B.Alternatively, if the device 1510 from which the audio data 1512 wasreceived is a public or semi-public device, the system may userinformation about the location of the device to cross-check otherpotential user locating information (such as calendar data, etc.) topotentially narrow the potential users to be recognized with respect tothe audio data 1512.

The secondary data 1650 may additionally include geographic coordinatedata associated with the device 1510. For example, a profile associatedwith a vehicle may indicate multiple users (e.g., user A and user B).The vehicle may include a global positioning system (GPS) indicatinglatitude and longitude coordinates of the vehicle when the audio data1512 is captured by the vehicle. As such, if the vehicle is located at acoordinate corresponding to a work location/building of user A, such mayincrease user recognition confidence data associated with user A, butdecrease user recognition confidence data of all other users indicatedin the profile associated with the vehicle. Global coordinates andassociated locations (e.g., work, home, etc.) may be indicated in a userprofile associated with the device 1510. The global coordinates andassociated locations may be associated with respective users in the userprofile storage 1572.

The secondary data 1650 may also include other data/signals aboutactivity of a particular user that may be useful in performing userrecognition of an input utterance. For example, if a user has recentlyentered a code to disable a home security alarm, and the utterancecorresponds to a device at the home, signals from the home securityalarm about the disabling user, time of disabling, etc. may be reflectedin the secondary data 1650 and considered by the audio component 1562.If a mobile device (such as a phone, Tile, dongle, or other device)known to be associated with a particular user is detected proximate to(for example physically close to, connected to the same WiFi network as,or otherwise nearby) the device 1510, this may be reflected in thesecondary data 1650 and considered by the audio component 1562.

The user recognition confidence data 1640 output by the audio component1562 may be used by other components of the user recognition component1560 and/or may be sent to one or more speechlets 1590, skill servers1540, the orchestrator 330, or to other components. The speechlet(s)1590/skill server(s) 1540 that receives the NLU results and the userrecognition confidence score data 1640 (or other user recognitionresults as output by the user recognition component 1560) may bedetermined by the server(s) 1540 as corresponding to content responsiveto the utterance in the audio data 1512. For example, if the audio data1512 includes the utterance “Play my music,” the NLU results and userrecognition confidence data 1640 (or other output user recognition data)may be sent to a music playing speechlet(s) 1590/skill server(s) 1540.

FIG. 17 illustrates how NLU processing is performed on audio data.Generally, the NLU component 1554 attempts to make a semanticinterpretation of text represented in text data (e.g., ASR resultsoutput by the ASR component 1552). That is, the NLU component 1554determines the meaning behind the text represented in text data based onthe individual words. The NLU component 1554 interprets text to derivean intent or a desired action from an utterance as well as the pertinentpieces of information in the text that allow a device (e.g., device1510, server(s) 1540, speechlet(s) 1590, skill server(s) 1540) tocomplete that action.

The NLU component 1554 may process text data including several textualinterpretations of a single utterance. For example, if the ASR component1552 outputs ASR results including an N-best list of textualinterpretations, the NLU component 1554 may process the text data withrespect to all (or a portion of) the textual interpretations representedtherein.

The NLU component 1554 may include one or more recognizers 1720. Eachrecognizer 1720 may be associated with a different speechlet 1590. TheNLU component 1554 may determine a speechlet 1590 potentially associatedwith a textual interpretation represented in text data input thereto inorder to determine the proper recognizer 1720 to process the textualinterpretation. The NLU component 1554 may determine a single textualinterpretation is potentially associated with more than one speechlet1590. Multiple recognizers 1720 may be functionally linked (e.g., atelephony/communications recognizer and a calendaring recognizer mayutilize data from the same contact list).

If the NLU component 1554 determines a specific textual interpretationis potentially associated with multiple speechlets 1590, the recognizers1720 associated with the speechlets 1590 may process the specifictextual interpretation in parallel. For example, if a specific textualinterpretation potentially implicates both a communications speechletand a music speechlet, a recognizer associated with the communicationsspeechlet may process the textual interpretation in parallel, orsubstantially in parallel, with a recognizer associated with the musicspeechlet processing the textual interpretation. The output generated byeach recognizer may be scored, with the overall highest scored outputfrom all recognizers ordinarily being selected to be the correct result.

The NLU component 1554 may communicate with various storages todetermine the potential speechlet(s) associated with a textualinterpretation. The NLU component 1554 may communicate with an NLUstorage 1740, which includes databases of devices (1746) identifyingspeechlets associated with specific devices. For example, the device1510 may be associated with speechlets for music, calendaring, contactlists, device-specific communications, etc. In addition, the NLUcomponent 1554 may communicate with an entity library 1730, whichincludes database entries about specific services on a specific device,either indexed by device ID, user ID, or group user ID, or some otherindicator.

Each recognizer 1720 may include a named entity recognition (NER)component 1722. The NER component 1722 attempts to identify grammars andlexical information that may be used to construe meaning with respect toa textual interpretation input therein. The NER component 1722identifies portions of text represented in text data input into the NLUcomponent 1554 that correspond to a named entity that may berecognizable by the system. The NER component 1722 (or other componentof the NLU component 1554) may also determine whether a word refers toan entity that is not explicitly mentioned in the utterance text, forexample “him,” “her,” “it” or other anaphora, exophora, or the like.

Each recognizer 1720, and more specifically each NER component 1722, maybe associated with a particular grammar model and/or database 1748, aparticular set of intents/actions 1742, and a particular personalizedlexicon 1728. Each gazetteer may include speechlet-indexed lexicalinformation associated with a particular user and/or device. Forexample, the Gazetteer A includes speechlet-indexed lexical information1728. A user's music speechlet lexical information might include albumtitles, artist names, and song names, for example, whereas a user'scontact-list lexical information might include the names of contacts.Since every user's music collection and contact list is presumablydifferent, this personalized information improves entity resolution.

An NER component 1722 applies grammar models 1748 and lexicalinformation 1728 associated with the speechlet (associated with therecognizer 1720 implementing the NER component 1722) to determine amention one or more entities in a textual interpretation input therein.In this manner, the NER component 1722 identifies “slots” (i.e.,particular words in a textual interpretation) that may be needed forlater command processing. The NER component 1722 may also label eachslot with a type of varying levels of specificity (e.g., noun, place,city, artist name, song name, etc.).

Each grammar model 1748 includes the names of entities (i.e., nouns)commonly found in speech about the particular speechlet to which thegrammar model 1748 relates, whereas the lexical information 1728 ispersonalized to the user(s) and/or the device 1510 from which the audiodata 1512 originated. For example, a grammar model 1748 associated witha shopping speechlet may include a database of words commonly used whenpeople discuss shopping.

A downstream process called named entity resolution actually links aportion of text to an actual specific entity known to the system. Toperform named entity resolution, the NLU component 1554 may utilizegazetteer information stored in an entity library storage 1730. Thegazetteer information may be used to match text represented in text dataoutput by the ASR component 1552 with different entities, such as songtitles, contact names, etc. Gazetteers may be linked to users (e.g., aparticular gazetteer may be associated with a specific user's musiccollection), may be linked to certain speechlet categories (e.g.,shopping, music, video, communications, etc.), or may be organized in avariety of other ways.

Each recognizer 1720 may also include an intent classification (IC)component 1724. The IC component 1724 parses an input textualinterpretation to determine an intent(s) of the speechlet associatedwith the recognizer 1720 that potentially corresponds to the textualinterpretation. An intent corresponds to an action to be performed thatis responsive to the command represented by the textual interpretation.The IC component 1724 may communicate with a database 1742 of wordslinked to intents. For example, a music intent database may link wordsand phrases such as “quiet,” “volume off,” and “mute” to a “mute”intent. The IC component 1724 identifies potential intents by comparingwords in the textual interpretation to the words and phrases in anintents database 1742 associated with the speechlet that is associatedwith the recognizer 1720 implementing the IC component 1724.

The intents identifiable by a specific IC component 1724 are linked tospeechlet-specific (i.e., the speechlet associated with the recognizer1720 implementing the IC component 1724) grammar frameworks 1748 with“slots” to be filled. Each slot of a grammar framework 1748 correspondsto a portion of the text interpretation that the system believescorresponds to an entity. For example, a grammar framework 1748corresponding to a <PlayMusic> intent may correspond to textualinterpretation sentence structures such as “Play {Artist Name},” “Play{Album Name},” “Play {Song name},” “Play {Song name} by {Artist Name},”etc. However, to make resolution more flexible, grammar frameworks 1748may not be structured as sentences, but rather based on associatingslots with grammatical tags.

For example, an NER component 1722 may parse a textual interpretation toidentify words as subject, object, verb, preposition, etc. based ongrammar rules and/or models prior to recognizing named entities in thetextual interpretation. An IC component 1724 (implemented by the samerecognizer 1720 as the NER component 1722) may use the identified verbto identify an intent. The NER component 1722 may then determine agrammar model 1748 associated with the identified intent. For example, agrammar model 1748 for an intent corresponding to <PlayMusic> mayspecify a list of slots applicable to play the identified “object” andany object modifier (e.g., a prepositional phrase), such as {ArtistName}, {Album Name}, {Song name}, etc. The NER component 1722 may thensearch corresponding fields in a lexicon 1728 associated with thespeechlet associated with the recognizer 1720 implementing the NERcomponent 1722, attempting to match words and phrases in the textualinterpretation the NER component 1722 previously tagged as a grammaticalobject or object modifier with those identified in the lexicon 1728.

An NER component 1722 may perform semantic tagging, which is thelabeling of a word or combination of words according to theirtype/semantic meaning. An NER component 1722 may parse a textualinterpretation heuristic grammar rules, or a model may be constructedusing techniques such as hidden Markov models, maximum entropy models,log linear models, conditional random fields (CRF), and the like. Forexample, an NER component 1722 implemented by a music speechletrecognizer 1720 may parse and tag a textual interpretation correspondingto “play mother's little helper by the rolling stones” as {Verb}:“Play,” {Object}: “mother's little helper,” {Object Preposition}: “by,”and {Object Modifier}: “the rolling stones.”

The NER component 1722 identifies “Play” as a verb based on a worddatabase associated with the music speechlet, which an IC component 1724(also implemented by the music speechlet recognizer 1720) may determinecorresponds to a <PlayMusic> intent. At this stage, no determination hasbeen made as to the meaning of “mother's little helper” and “the rollingstones,” but based on grammar rules and models, the NER component 1722has determined that the text of these phrases relates to the grammaticalobject (i.e., entity) of the textual interpretation.

The frameworks linked to the intent are then used to determine whatdatabase fields should be searched to determine the meaning of thesephrases, such as searching a user's gazetteer for similarity with theframework slots. For example, a framework for a <PlayMusic> intent mightindicate to attempt to resolve the identified object based {ArtistName}, {Album Name}, and {Song name}, and another framework for the sameintent might indicate to attempt to resolve the object modifier based on{Artist Name}, and resolve the object based on {Album Name} and {SongName} linked to the identified {Artist Name}. If the search of thegazetteer does not resolve a slot/field using gazetteer information, theNER component 1722 may search the database of generic words associatedwith the speechlet (in the knowledge base 1726). For example, if thetextual interpretation was “play songs by the rolling stones,” afterfailing to determine an album name or song name called “songs” by “therolling stones,” the NER component 1722 may search the speechletvocabulary for the word “songs.” In the alternative, generic words maybe checked before the gazetteer information, or both may be tried,potentially producing two different results.

The NLU component 1554 may tag the textual interpretation to attributemeaning to the textual interpretation. For example, the NLU component1554 may tag “play mother's little helper by the rolling stones” as:{intent}<PlayMusic>, {artist name} rolling stones, {media type} SONG,and {song title} mother's little helper. For further example, the NLUcomponent 1554 may tag “play songs by the rolling stones” as:{intent}<PlayMusic>, {artist name} rolling stones, and {media type}SONG.

Certain recognizers 1720 may only be authorized to operate for certainusers. For example, some recognizers 1720 may only be authorized tooperate for adult users (e.g., users of eighteen years of age or older).The NLU component 1554 may use some combination of user recognition data1580 user profile data to confirm the user's identity/type. Basedthereon, the NLU component 1554 may determine which recognizers 1720 mayoperate with respect to input text data (i.e., ASR results 741).

Each recognizer 1720 may output data corresponding to a single textualinterpretation or to an N-best list of textual interpretations. The NLUcomponent 1554 may compile the output data of the recognizers 1720 intoa single N-best list, and may send N-best list data 1810 (representingthe N-best list) to a pruning component 1820 (as illustrated in FIG.18). The tagged textual interpretations in the N-best list data 1810 mayeach be associated with a respective score indicating the tagged textualinterpretation corresponds to the speechlet associated with therecognizer 1720 from which the tagged textual interpretation was output.For example, the N-best list data 1810 may be represented as:

[0.95] Intent: <PlayMusic> ArtistName: Lady Gaga SongName: Poker Face

[0.70] Intent: <PlayVideo> ArtistName: Lady Gaga VideoName: Poker Face

[0.01] Intent: <PlayMusic> ArtistName: Lady Gaga AlbumName: Poker Face

[0.01] Intent: <PlayMusic> SongName: Pokerface

The pruning component 1820 creates a new, shorter N-best list (i.e.,represented in N-best list data 1840 discussed below) based on theN-best list data 1810. The pruning component 1820 may sort the taggedtextual interpretations represented in the N-best list data 1810according to their respective scores.

The pruning component 1820 may perform score thresholding with respectto the N-best list data 1810. For example, the pruning component 1820may select textual interpretations represented in the N-best list data1810 associated with a score satisfying (e.g., meeting and/or exceeding)a score threshold. The pruning component 1820 may also or alternativelyperform number of textual interpretation thresholding. For example, thepruning component 1820 may select the top scoring textualinterpretation(s) associated with each different category of speechlet(e.g., music, shopping, communications, etc.) represented in the N-bestlist data 1810, with the new N-best list data 1840 including a totalnumber of textual interpretations meeting or falling below a thresholdnumber of textual interpretations. The purpose of the pruning component1820 is to create a new list of top scoring textual interpretations sothat downstream (more resource intensive) processes may only operate onthe tagged textual interpretations that most likely correspond to thecommand input to the system.

The NLU component 1554 may also include a light slot filler component1830. The light slot filler component 1830 can take text from slotsrepresented in the textual interpretation(s) output by the pruningcomponent 1820 and alter it to make the text more easily processed bydownstream components. The light slot filler component 1830 may performlow latency operations that do not involve heavy operations such asreference to a knowledge base. The purpose of the light slot fillercomponent 1830 is to replace words with other words or values that maybe more easily understood by downstream components. For example, if atextual interpretation includes the word “tomorrow,” the light slotfiller component 1830 may replace the word “tomorrow” with an actualdate for purposes of downstream processing. Similarly, the light slotfiller component 1830 may replace the word “CD” with “album” or thewords “compact disc.” The replaced words are then included in the N-bestlist data 1840.

The NLU component 1554 sends the N-best list data 1840 to an entityresolution component 1850. The entity resolution component 1850 canapply rules or other instructions to standardize labels or tokens fromprevious stages into an intent/slot representation. The precisetransformation may depend on the speechlet (e.g., for a travelspeechlet, the entity resolution component 1850 may transform a textmention of “Atlanta airport” to the standard ATL three-letter codereferring to the airport). The entity resolution component 1850 canrefer to an authority source (e.g., a knowledge base) that is used tospecifically identify the precise entity referred to in each slot ofeach textual interpretation represented in the N-best list data 1840.Specific intent/slot combinations may also be tied to a particularsource, which may then be used to resolve the text. In the example “playsongs by the stones,” the entity resolution component 1850 may referencea personal music catalog, Amazon Music account, user profile 802(described herein), or the like. The entity resolution component 1850may output data including an altered N-best list that is based on theN-best list represented in the N-best list data 1840, but also includesmore detailed information (e.g., entity IDs) about the specific entitiesmentioned in the slots and/or more detailed slot data that caneventually be used by a speechlet(s) 1590 which may be incorporated intothe server(s) 1540 components or pipeline or may be on a separatedevice(s) (e.g., a skill server(s) 1540) in communication with theserver(s) 1540. The NLU component 1554 may include multiple entityresolution components 1850 and each entity resolution component 1850 maybe specific to one or more speechlets.

The entity resolution component 1850 may not be successful in resolvingevery entity and filling every slot represented in the N-best listrepresented in the N-best list data 1840. This may result in the entityresolution component 1850 outputting incomplete results. The NLUcomponent 1554 may include a final ranker component 1860, which mayconsider such errors when determining how to rank the tagged textualinterpretations for potential execution. For example, if a bookspeechlet recognizer 1720 outputs a tagged textual interpretationincluding a <ReadBook> intent flag, but the entity resolution component1850 cannot find a book with a title matching the text of the item, thefinal ranker component 1860 may re-score that particular tagged textualinterpretation to be given a lower score. The final ranker component1860 may also assign a particular confidence to each tagged textualinterpretation input therein. The confidence score of a particulartagged textual interpretation may be affected by whether the taggedtextual interpretation has unfilled slots. For example, if a taggedtextual interpretation associated with a first speechlet includes slotsthat are all filled/resolved, that tagged textual interpretation may beassociated with a higher confidence than another tagged textualinterpretation including at least some slots that areunfilled/unresolved.

The final ranker component 1860 may apply re-scoring, biasing, or othertechniques to obtain the most preferred tagged and resolved textualinterpretation. To do so, the final ranker component 1860 may considernot only the data output by the entity resolution component 1850, butmay also consider other data 1870. The other data 1870 may include avariety of information. For example, the other data 1870 may includespeechlet rating or popularity data. For example, if one speechlet has aparticularly high rating, the final ranker component 1860 may increasethe score of a textual interpretation(s) associated with or otherwiseinvoking that particular speechlet. The other data 1870 may also includeinformation about speechlets that have been specifically enabled by theuser. For example, the final ranker component 1860 may assign higherscores to textual interpretations associated with or otherwise invokingenabled speechlets than textual interpretations associated with orotherwise invoking non-enabled speechlets. User history may also beconsidered, such as if the user regularly uses a particular speechlet ordoes so at particular times of day. Date, time, location, weather, typeof device 1510, user ID, context, and other information may also beconsidered. For example, the final ranker component 1860 may considerwhen any particular speechlets are currently active (e.g., music beingplayed, a game being played, etc.).

Following final ranking, the NLU component 1554 may output NLU outputdata 1880. The NLU component 1554 may send the NLU output data 1880 tothe orchestrator component 1570, which sends the NLU output data 1880 toan appropriate speechlet 1590 or skill server(s) 1540 (e.g., oneconfigured to execute a command based on the textual interpretationrepresented in the NLU output data 1880). The NLU output data 1880 mayinclude an indicator of the intent of the textual interpretation alongwith data associated with the intent, for example an indication that theintent is <PlayMusic> and the music to be played is “Adele.” Multipleinstances of NLU output data (e.g., 1880 a-1880 n) may be output for agiven set of text data input into the NLU component 1554.

The speechlet(s) 1590/skill server(s) 1540 provides the server(s) 1540with data responsive to the NLU output data 1880 received thereby. Ifthe data is text data that needs to be converted to computerized speech,the orchestrator component 1570 sends the text data to the TTS component1592.

User recognition data 1580 may also be used by the NLU component 1554and/or the speechlet 1590/skill server(s) 1540 to ensure that any userspecific commands are properly interpreted and executed.

A user identified using techniques described herein may be associatedwith a user identifier (ID), user profile, or other information knownabout the user by the system. As part of the user recognition techniquesdescribed herein, the system may determine the user identifier, userprofile, or other such information. The profile storage 1572 may includedata corresponding to profiles that may be used by the system to performspeech processing. Such profiles may include a user profile that linksvarious data about a user such as user preferences, user owned devices,address information, contacts, enabled speechlets, payment information,etc. Each user profile may be associated with a different user ID. Aprofile may be an umbrella profile specific to a group of users. Thatis, a user profile may encompass two or more individual user profiles,each associated with a unique respective user ID. For example, a profilemay be a household profile that encompasses user profiles associatedwith multiple users of a single household. A profile may includepreferences shared by all the user profiles encompassed thereby. Eachuser profile encompassed under a single user profile may includepreferences specific to the user associated therewith. That is, eachuser profile may include preferences unique with respect to one or moreother user profiles encompassed by the same profile. A user profile maybe a stand-alone profile or may be encompassed under a group profile.

A profile may also be a device profile corresponding to informationabout a particular device, for example a device ID, location, ownerentity, whether the device is in a public, semi-public, or privatelocation (which may be indicated by a public and/or semi-public flag),device capabilities, device hardware, or the like.

A profile may also be an entity profile, for example belonging to abusiness, organization, or other non-user entity. Such an entity profilemay include information that may otherwise be found in a user and/ordevice profile, only such information is associated with the entity. Theentity profile may include information regarding which users and/ordevices are associated with the entity.

For example, as illustrated in FIG. 19, a group profile 1900 may includeinformation about users, devices, and locations of the devices. In theexample illustrated, the group profile 1900 is associated with a homeand lists four devices: one device in a living room, one device in akitchen, one device in a den/office, and one device in a bedroom.Various other information may also be stored and/or associated with aprofile.

Program module(s), applications, or the like disclosed herein mayinclude one or more software components including, for example, softwareobjects, methods, data structures, or the like. Each such softwarecomponent may include computer-executable instructions that, responsiveto execution, cause at least a portion of the functionality describedherein (e.g., one or more operations of the illustrative methodsdescribed herein) to be performed.

A software component may be coded in any of a variety of programminglanguages. An illustrative programming language may be a lower-levelprogramming language such as an assembly language associated with aparticular hardware architecture and/or operating system platform. Asoftware component comprising assembly language instructions may requireconversion into executable machine code by an assembler prior toexecution by the hardware architecture and/or platform.

Another example programming language may be a higher-level programminglanguage that may be portable across multiple architectures. A softwarecomponent comprising higher-level programming language instructions mayrequire conversion to an intermediate representation by an interpreteror a compiler prior to execution.

Other examples of programming languages include, but are not limited to,a macro language, a shell or command language, a job control language, ascript language, a database query or search language, or a reportwriting language. In one or more example embodiments, a softwarecomponent comprising instructions in one of the foregoing examples ofprogramming languages may be executed directly by an operating system orother software component without having to be first transformed intoanother form.

A software component may be stored as a file or other data storageconstruct. Software components of a similar type or functionally relatedmay be stored together such as, for example, in a particular directory,folder, or library. Software components may be static (e.g.,pre-established or fixed) or dynamic (e.g., created or modified at thetime of execution).

Software components may invoke or be invoked by other softwarecomponents through any of a wide variety of mechanisms. Invoked orinvoking software components may comprise other custom-developedapplication software, operating system functionality (e.g., devicedrivers, data storage (e.g., file management) routines, other commonroutines and services, etc.), or third-party software components (e.g.,middleware, encryption, or other security software, database managementsoftware, file transfer or other network communication software,mathematical or statistical software, image processing software, andformat translation software).

Software components associated with a particular solution or system mayreside and be executed on a single platform or may be distributed acrossmultiple platforms. The multiple platforms may be associated with morethan one hardware vendor, underlying chip technology, or operatingsystem. Furthermore, software components associated with a particularsolution or system may be initially written in one or more programminglanguages, but may invoke software components written in anotherprogramming language.

Computer-executable program instructions may be loaded onto aspecial-purpose computer or other particular machine, a processor, orother programmable data processing apparatus to produce a particularmachine, such that execution of the instructions on the computer,processor, or other programmable data processing apparatus causes one ormore functions or operations specified in the flow diagrams to beperformed. These computer program instructions may also be stored in acomputer-readable storage medium (CRSM) that upon execution may direct acomputer or other programmable data processing apparatus to function ina particular manner, such that the instructions stored in thecomputer-readable storage medium produce an article of manufactureincluding instruction means that implement one or more functions oroperations specified in the flow diagrams. The computer programinstructions may also be loaded onto a computer or other programmabledata processing apparatus to cause a series of operational elements orsteps to be performed on the computer or other programmable apparatus toproduce a computer-implemented process.

Additional types of CRSM that may be present in any of the devicesdescribed herein may include, but are not limited to, programmablerandom access memory (PRAM), SRAM, DRAM, RAM, ROM, electrically erasableprogrammable read-only memory (EEPROM), flash memory or other memorytechnology, compact disc read-only memory (CD-ROM), digital versatiledisc (DVD) or other optical storage, magnetic cassettes, magnetic tape,magnetic disk storage or other magnetic storage devices, or any othermedium which can be used to store the information and which can beaccessed. Combinations of any of the above are also included within thescope of CRSM. Alternatively, computer-readable communication media(CRCM) may include computer-readable instructions, program module(s), orother data transmitted within a data signal, such as a carrier wave, orother transmission. However, as used herein, CRSM does not include CRCM.

Although embodiments have been described in language specific tostructural features and/or methodological acts, it is to be understoodthat the disclosure is not necessarily limited to the specific featuresor acts described. Rather, the specific features and acts are disclosedas illustrative forms of implementing the embodiments. Conditionallanguage, such as, among others, “can,” “could,” “might,” or “may,”unless specifically stated otherwise, or otherwise understood within thecontext as used, is generally intended to convey that certainembodiments could include, while other embodiments do not include,certain features, elements, and/or steps. Thus, such conditionallanguage is not generally intended to imply that features, elements,and/or steps are in any way required for one or more embodiments or thatone or more embodiments necessarily include logic for deciding, with orwithout user input or prompting, whether these features, elements,and/or steps are included or are to be performed in any particularembodiment.

1. (canceled)
 2. A method comprising: receiving, by at least one of afirst device or a second device, first audio data representing anutterance, wherein the first device is communicatively coupled to thesecond device, and wherein the first device is in a locked state whenthe first audio data is received; sending the first audio data for voiceprocessing to determine a meaning of the utterance; receiving a firstcommand to display information associated with an application on thefirst device; and presenting, based on the first device beingcommunicatively coupled to the second device, the information in agraphical interface without requesting user authentication at the firstdevice.
 3. The method of claim 2, further comprising: receiving, by thefirst device, second audio data to access the information; determiningthat the first device is in the locked state and that the first devicelacks a connection to the second device; and preventing access to theinformation based on the second audio data until user authentication isreceived or the first device is communicatively coupled to the seconddevice.
 4. The method of claim 2, further comprising determining thefirst device is physically proximate to the second device based on thefirst device being connected to a wireless connection and the seconddevice being connected to the wireless connection.
 5. The method ofclaim 4, wherein the wireless connection comprises at least one of: aBluetooth connection or WiFi connection.
 6. The method of claim 2,further comprising determining that the first audio data comprises awake word.
 7. The method of claim 2, wherein: the application on thefirst device comprises a calendar application; the information comprisesevent information for an event associated with a calendar application;and the method further comprises enabling touch-based functionality ofthe calendar application based on the user authentication beingreceived.
 8. The method of claim 2, wherein the information is presentedin the graphical interface of the second device.
 9. A first devicecomprising: memory that stores computer-executable instructions; and atleast one processor configured to access the memory and execute thecomputer-executable instructions to: receive first audio datarepresenting an utterance, wherein the first device is communicativelycoupled to a second device, and wherein the first device is in a lockedstate when the first audio data is received; send the first audio datafor voice processing to determine a meaning of the utterance; receive afirst command to display information associated with an application onthe first device; and present, based on the first device beingcommunicatively coupled to the second device, the information in agraphical interface without requesting user authentication at the firstdevice.
 10. The first device of claim 9, wherein the at least oneprocessor is further configured to: receive second audio data to accessthe information; determine that the first device is in the locked stateand that the first device lacks a connection to the second device; andprevent access to the information based on the second audio data untiluser authentication is received or the first device is communicativelycoupled to the second device.
 11. The first device of claim 10, whereinthe at least one processor is further configured to determine the firstdevice is physically proximate to the second device based on the firstdevice being connected to a wireless connection and the second devicebeing connected to the wireless connection.
 12. The first device ofclaim 11, wherein the wireless connection comprises at least one of: aBluetooth connection or WiFi connection.
 13. The first device of claim9, wherein the at least one processor is further configured to accessthe memory and execute the computer-executable instructions to determinethat the first audio data comprises a wake word.
 14. The first device ofclaim 9, wherein: the application on the first device comprises acalendar application; the information comprises event information for anevent associated with a calendar application; and the at least oneprocessor is further configured to enable touch-based functionality ofthe calendar application based on the user authentication beingreceived.
 15. A non-transitory computer-readable storage medium storingcomputer-executable instructions that, as a result of being executed byone or more processors of a first device, cause the first device to:receive first audio data representing an utterance, wherein the firstdevice is communicatively coupled to a second device, and wherein thefirst device is in a locked state when the first audio data is received;send the first audio data for voice processing to determine a meaning ofthe utterance; receive a first command to display information associatedwith an application on the first device; and present, based on the firstdevice being communicatively coupled to the second device, theinformation in a graphical interface without requesting userauthentication at the first device.
 16. The non-transitorycomputer-readable storage medium of claim 15, wherein the instructions,as a result of being executed by the one or more processors of the firstdevice, further cause the first device to: receive second audio data toaccess the information; determine that the first device is in the lockedstate and that the first device lacks a connection to the second device;and prevent access to the information based on the second audio datauntil the user authentication is received or the first device iscommunicatively coupled to the second device.
 17. The non-transitorycomputer-readable storage medium of claim 16, wherein the instructions,as a result of being executed by the one or more processors of the firstdevice, further cause the first device to determine the first device isphysically proximate to the second device based on the first devicebeing connected to a wireless connection and the second device beingconnected to the wireless connection.
 18. The non-transitorycomputer-readable storage medium of claim 17, wherein the wirelessconnection comprises at least one of: a Bluetooth connection or WiFiconnection.
 19. The non-transitory computer-readable storage medium ofclaim 15, wherein the instructions, as a result of being executed by theone or more processors of the first device, further cause the firstdevice to determine that the first audio data comprises a wake word. 20.The non-transitory computer-readable storage medium of claim 15,wherein: the application comprises a calendar application; theinformation comprises event information for an event associated with acalendar application; and the instructions, as a result of beingexecuted by the one or more processors of the first device, furthercause the first device to enable touch-based functionality of thecalendar application based on the user authentication being received.21. The non-transitory computer-readable storage medium of claim 15,wherein: the application on the first device comprises a calendarapplication; the information comprises event information for an eventassociated with a calendar application; and the instructions, as aresult of being executed by the one or more processors of the firstdevice, further cause the first device to enable touch-basedfunctionality of the calendar application based on the userauthentication being received.